Learning Trajectory-Word Alignments for Video-Language Tasks

Xu Yang; Zhangzikang Li; Haiyang Xu; Hanwang Zhang; Qinghao Ye; Chenliang Li; Ming Yan; Yu Zhang; Fei Huang; Songfang Huang

ビデオ言語タスクのための軌跡と単語の配置の学習

イメージ言語 BERT (IL-BERT) およびビデオ言語 BERT (VDL-BERT) では、オブジェクトと単語の位置合わせが重要な役割を果たします。オブジェクトがいくつかの空間パッチをカバーする画像の場合とは異なり、ビデオ内のオブジェクトは通常、オブジェクトの軌跡として表示されます。ただし、最新の VDL-BERT は、通常 IL-BERT に従ってパッチから単語 (P2W) への注意を展開するというこの軌道特性を無視していますが、そのような注意は些細な空間的コンテキストを過剰に利用し、重要な時間的コンテキストを無視する可能性があります。これを修正するために、ビデオ言語タスクを解決するための軌道と単語の配置を学習するための新しいTW-BERTを提案します。このような配置は、新しく設計された単語への軌跡 (T2W) の注意によって学習されます。 T2Wアテンションに加えて、以前のVDL-BERTに従って、クロスモーダルエンコーダーでワードツーパッチ（W2P）アテンションを設定します。 T2W と W2P のアテンションにはさまざまな構造があるため、クロスモーダルエンコーダーは非対称です。この非対称クロスモーダルエンコーダーがロバストな視覚と言語の関連付けを構築するのをさらに支援するために、ビデオおよびテキストエンコーダーによって計算された埋め込みスペースを近づけるためのきめの細かい「融合前の整列」戦略を提案します。提案された戦略とT2Wの注意により、TW-BERTは、テキストからビデオへの検索タスクでSOTAパフォーマンスを達成し、ビデオの質問応答タスクで、より多くのデータでトレーニングされたいくつかのVDL-BERTと同等のパフォーマンスを達成します。コードは補足資料で入手できます。

Aligning objects with words plays a critical role in Image-Language BERT (IL-BERT) and Video-Language BERT (VDL-BERT). Different from the image case where an object covers some spatial patches, an object in a video usually appears as an object trajectory, i.e., it spans over a few spatial but longer temporal patches and thus contains abundant spatiotemporal contexts. However, modern VDL-BERTs neglect this trajectory characteristic that they usually follow IL-BERTs to deploy the patch-to-word (P2W) attention while such attention may over-exploit trivial spatial contexts and neglect significant temporal contexts. To amend this, we propose a novel TW-BERT to learn Trajectory-Word alignment for solving video-language tasks. Such alignment is learned by a newly designed trajectory-to-word (T2W) attention. Besides T2W attention, we also follow previous VDL-BERTs to set a word-to-patch (W2P) attention in the cross-modal encoder. Since T2W and W2P attentions have diverse structures, our cross-modal encoder is asymmetric. To further help this asymmetric cross-modal encoder build robust vision-language associations, we propose a fine-grained ``align-before-fuse'' strategy to pull close the embedding spaces calculated by the video and text encoders. By the proposed strategy and T2W attention, our TW-BERT achieves SOTA performances on text-to-video retrieval tasks, and comparable performances on video question answering tasks with some VDL-BERTs trained on much more data. The code will be available in the supplementary material.

updated: Fri Jan 06 2023 10:06:44 GMT+0000 (UTC)

published: Thu Jan 05 2023 08:21:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト