Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Yuchong Sun; Hongwei Xue; Ruihua Song; Bei Liu; Huan Yang; Jianlong Fu

マルチモーダル時間対照学習による長形式ビデオ言語の事前トレーニング

大規模なビデオ言語の事前トレーニングにより、ビデオ言語の理解タスクが大幅に改善されました。ビデオ言語の事前トレーニングに関するこれまでの研究は、主に短い形式のビデオ (つまり、30 秒以内) と文章に焦点を当てており、長い形式のビデオ言語の事前トレーニングはほとんど調査されていません。長編ビデオと言語から表現を直接学習することは、多くの長編ビデオ言語理解タスクに役立つ可能性があります。ただし、長距離の関係をモデル化することの難しさと、より多くのフレームによって引き起こされる計算上の負担が大きいため、困難です。このホワイトペーパーでは、Long-Form VIdeo-Language 事前トレーニングモデル (LF-VILA) を導入し、既存の公開データセットから構築された大規模な長形式ビデオおよび段落データセットでトレーニングします。豊かな時間的ダイナミクスを効果的に捉え、効率的なエンドツーエンドの方法でビデオと言語をより適切に調整するために、LF-VILA モデルに 2 つの斬新なデザインを導入します。最初に、マルチモーダルテンポラルコントラスト (MTC) 損失を提案して、長い形式のビデオと段落の間のきめ細かな配置を促進することにより、さまざまなモダリティ間の時間的関係を学習します。次に、Transformer での計算コストを削減しながら、長距離依存関係を効果的にキャプチャするための Hierarchical Temporal Window Attention (HTWA) メカニズムを提案します。段落からビデオへの検索と長い形式のビデオの質問応答の 7 つの下流の長い形式のビデオ言語理解タスクで、事前にトレーニングされた LF-VILA モデルを微調整し、新しい最先端のパフォーマンスを実現します。 .具体的には、モデルは、ActivityNet の段落からビデオへの検索タスクで 16.1%、How2QA タスクで 2.4% の相対的な改善をそれぞれ達成しています。 https://github.com/microsoft/XPretrain でコード、データセット、事前トレーニング済みモデルをリリースしています。

Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i.e., within 30 seconds) and sentences, leaving long-form video-language pre-training rarely explored. Directly learning representation from long-form videos and language may benefit many long-form video-language understanding tasks. However, it is challenging due to the difficulty of modeling long-range relationships and the heavy computational burden caused by more frames. In this paper, we introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from an existing public dataset. To effectively capture the rich temporal dynamics and to better align video and language in an efficient end-to-end manner, we introduce two novel designs in our LF-VILA model. We first propose a Multimodal Temporal Contrastive (MTC) loss to learn the temporal relation across different modalities by encouraging fine-grained alignment between long-form videos and paragraphs. Second, we propose a Hierarchical Temporal Window Attention (HTWA) mechanism to effectively capture long-range dependency while reducing computational cost in Transformer. We fine-tune the pre-trained LF-VILA model on seven downstream long-form video-language understanding tasks of paragraph-to-video retrieval and long-form video question-answering, and achieve new state-of-the-art performances. Specifically, our model achieves 16.1% relative improvement on ActivityNet paragraph-to-video retrieval task and 2.4% on How2QA task, respectively. We release our code, dataset, and pre-trained models at https://github.com/microsoft/XPretrain.

updated: Thu Mar 02 2023 09:05:43 GMT+0000 (UTC)

published: Wed Oct 12 2022 09:08:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト