STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

Weihong Zhong; Mao Zheng; Duyu Tang; Xuan Luo; Heng Gong; Xiaocheng Feng; Bing Qin

STOA-VLP: ビデオ言語事前トレーニングのためのオブジェクトとアクションの時空間モデリング

通常、ビデオとテキストの間の全体的な調整を行う大規模なビデオ言語の事前トレーニングモデルは、さまざまなダウンストリームタスクで目覚ましい進歩を遂げていますが、事前トレーニング段階できめ細かい情報を採用するという考えはそうではありません。よく調べました。この作業では、空間次元と時間次元にわたってオブジェクトとアクション情報を共同でモデル化する事前トレーニングフレームワークである STOA-VLP を提案します。より具体的には、モデルは、フレーム全体のオブジェクトの軌跡と、ビデオの複数のアクション機能をきめ細かい機能と見なします。さらに、ビデオ言語モデルの事前トレーニングプロセスに両方の種類の情報をより適切に組み込むために、2 つの補助タスクを設計します。 1 つ目は、動的なオブジェクトとテキストの配置タスクで、オブジェクトの軌跡と関連する名詞トークンとの間のより良い接続を構築します。 2 つ目は時空間アクションセット予測です。これは、モデルがテキスト内のアクションを予測することにより、一貫したアクション機能を生成するように導きます。 3 つのダウンストリームタスク (ビデオキャプション作成、テキストビデオ検索、ビデオ質問応答) に関する広範な実験により、提案された STOA-VLP の有効性が実証されました (例: MSR-VTT ビデオキャプションベンチマークで Rouge-L が 3.7 改善、MSVD で精度が 2.9% 改善)。以前のアプローチと比較したビデオ質問応答ベンチマーク)。

Although large-scale video-language pre-training models, which usually build a global alignment between the video and the text, have achieved remarkable progress on various downstream tasks, the idea of adopting fine-grained information during the pre-training stage is not well explored. In this work, we propose STOA-VLP, a pre-training framework that jointly models object and action information across spatial and temporal dimensions. More specifically, the model regards object trajectories across frames and multiple action features from the video as fine-grained features. Besides, We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model. The first is the dynamic object-text alignment task, which builds a better connection between object trajectories and the relevant noun tokens. The second is the spatial-temporal action set prediction, which guides the model to generate consistent action features by predicting actions found in the text. Extensive experiments on three downstream tasks (video captioning, text-video retrieval, and video question answering) demonstrate the effectiveness of our proposed STOA-VLP (e.g. 3.7 Rouge-L improvements on MSR-VTT video captioning benchmark, 2.9% accuracy improvements on MSVD video question answering benchmark, compared to previous approaches).

updated: Mon Feb 20 2023 03:13:45 GMT+0000 (UTC)

published: Mon Feb 20 2023 03:13:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト