CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

Jianjie Luo; Yehao Li; Yingwei Pan; Ting Yao; Hongyang Chao; Tao Mei

CoCo-BERT：対照的なクロスモーダルマッチングとノイズ除去によるビデオ言語の事前トレーニングの改善

BERTタイプの構造は、視覚言語の事前トレーニングに革命をもたらし、多くの視覚言語のダウンストリームタスクで最先端の結果を達成しました。既存のソリューションは、主にマスクトークンを使用したマルチモーダル入力を利用して、マスクベースのプロキシ事前トレーニングタスク（マスクされた言語モデリングやマスクされたオブジェクト/フレーム予測など）をトリガーします。この作業では、そのようなマスクされた入力は、クロスモーダルマッチングプロキシタスクに必然的にノイズを導入し、したがって、固有の視覚と言語の関連付けを十分に検討しないままにすることを主張します。別の方法として、ビデオ言語の事前トレーニングのための特定の形式のクロスモーダルプロキシ目標、つまり、対照的なクロスモーダルマッチングおよびノイズ除去（CoCo）を導き出します。マスクされたフレーム/単語シーケンスを主要なマスクされていないもののノイズの多い増強として表示することにより、CoCoは、対照的な方法でマスクされた入力とマスクされていない入力の間のモード間マッチングとモード内ノイズ除去を同時に追求することにより、ビデオ言語の関連付けを強化します。当社のCoCoプロキシ目標は、Contrastive Cross-modal BERT（CoCo-BERT）と呼ばれる、ビデオ言語の事前トレーニング用の任意のBERTタイプのエンコーダ-デコーダ構造にさらに統合できます。 TVデータセットと新しく収集された大規模GIFビデオデータセット（ACTION）でCoCo-BERTを事前トレーニングします。幅広いダウンストリームタスク（クロスモーダル検索、ビデオ質問応答、ビデオキャプションなど）にわたる広範な実験を通じて、事前トレーニングされた構造としてのCoCo-BERTの優位性を示します。

BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that such masked inputs would inevitably introduce noise for cross-modal matching proxy task, and thus leave the inherent vision-language association under-explored. As an alternative, we derive a particular form of cross-modal proxy objective for video-language pre-training, i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens video-language association by simultaneously pursuing inter-modal matching and intra-modal denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy objective can be further integrated into any BERT-type encoder-decoder structure for video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset (ACTION). Through extensive experiments over a wide range of downstream tasks (e.g., cross-modal retrieval, video question answering, and video captioning), we demonstrate the superiority of CoCo-BERT as a pre-trained structure.

updated: Tue Dec 14 2021 16:22:44 GMT+0000 (UTC)

published: Tue Dec 14 2021 16:22:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト