End-to-End Video Captioning

Silvio Olivastri; Gurkirt Singh; Fabio Cuzzolin

エンドツーエンドのビデオキャプション

ビデオや言語など、さまざまなモダリティで通信を構築することは、ビデオキャプションなどの多くの視覚認識アプリケーションで最近重要になっています。機械翻訳に触発された最近のモデルは、エンコーダーデコーダー戦略を使用してこのタスクに取り組んでいます。（ビデオ）エンコーダーは伝統的に畳み込みニューラルネットワーク（CNN）であり、デコード（言語生成用）はリカレントニューラルネットワーク（RNN）を使用して行われます。ただし、現在の最先端の方法では、エンコーダとデコーダを別々にトレーニングします。 CNNは、オブジェクトやアクションの認識タスクについて事前にトレーニングされており、ビデオレベルの機能をエンコードするために使用されます。その後、デコーダはそのような静的な機能で最適化され、ビデオの説明が生成されます。このばらばらのセットアップは、入力（ビデオ）から出力（説明）へのマッピングには間違いなく準最適です。この作業では、エンコーダとデコーダの両方をエンドツーエンドで同時に最適化することを提案します。 2段階のトレーニング設定では、まず事前にトレーニングされたエンコーダーとデコーダーを使用してアーキテクチャを初期化します。次に、ネットワーク全体を微調整段階でエンドツーエンドでトレーニングし、ビデオキャプション生成に最も関連する機能を学習します。実験では、GoogLeNetとInception-ResNet-v2をエンコーダーとして使用し、オリジナルのSoft-Attention（SA-）LSTMをデコーダーとして使用します。他のコンピュータービジョンの問題で見られる利益と同様に、エンドツーエンドのトレーニングは、従来のばらばらのトレーニングプロセスよりも大幅に改善されることを示しています。 Microsoft Research Video Description（MSVD）およびMSR Video to Text（MSR-VTT）ベンチマークデータセットでエンドツーエンド（EtENet）ネットワークを評価し、EtENetがどのように最先端のパフォーマンスを達成するかを示します。

Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this task using an encoder-decoder strategy. The (video) encoder is traditionally a Convolutional Neural Network (CNN), while the decoding (for language generation) is done using a Recurrent Neural Network (RNN). Current state-of-the-art methods, however, train encoder and decoder separately. CNNs are pretrained on object and/or action recognition tasks and used to encode video-level features. The decoder is then optimised on such static features to generate the video's description. This disjoint setup is arguably sub-optimal for input (video) to output (description) mapping. In this work, we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion. In a two-stage training setting, we first initialise our architecture using pre-trained encoders and decoders -- then, the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation. In our experiments, we use GoogLeNet and Inception-ResNet-v2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder. Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training process. We evaluate our End-to-End (EtENet) Networks on the Microsoft Research Video Description (MSVD) and the MSR Video to Text (MSR-VTT) benchmark datasets, showing how EtENet achieves state-of-the-art performance across the board.

updated: Fri Nov 08 2019 10:28:48 GMT+0000 (UTC)

published: Thu Apr 04 2019 15:57:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト