Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

Dezhao Luo; Jiabo Huang; Shaogang Gong; Hailin Jin; Yang Liu

一般化可能なビデオモーメントの取得に向けて: 画像テキストの事前トレーニングへの視覚動的注入

視覚とテキストの間の相関関係は、ビデオモーメント検索 (VMR) に不可欠ですが、既存の方法は、視覚的理解とテキスト理解のために個別のトレーニング前の特徴抽出器に大きく依存しています。十分な時間境界注釈がなければ、普遍的なビデオテキストの配置を学習することは自明ではありません。この作業では、一般化可能な VMR を促進するために、大規模な画像テキストデータから派生したマルチモーダル相関を調べます。ビデオの変更をキャプチャする際の画像テキストの事前トレーニングモデルの制限に対処するために、ビデオの瞬間のモデルの理解を強化するために、Visual-Dynamic Injection (VDI) と呼ばれる一般的な方法を提案します。既存の VMR メソッドは時間認識ビデオ機能の構築に重点を置いていますが、一時的な変化に関するテキストの説明を認識することも重要ですが、静止画像を文章と照合することによる事前トレーニングでは元々見落とされていました。したがって、ビデオフレームから視覚的コンテキストと空間動的情報を抽出し、ビデオの変化を説明するフレーズ (動詞など) との整合性を明示的に強制します。そうすることで、動画内の関連する可能性のある視覚パターンとモーションパターンが、対応するテキスト埋め込み (挿入) にエンコードされ、より正確な動画とテキストの配置が可能になります。 2 つの VMR ベンチマークデータセット (Charades-STA と ActivityNet-Captions) で広範な実験を行い、最先端のパフォーマンスを実現します。特に、VDI は、テストサンプルが斬新なシーンやボキャブラリを含む、配布されていない分割でテストされる場合に顕著な利点をもたらします。

The correlation between the vision and text is essential for video moment retrieval (VMR), however, existing methods heavily rely on separate pre-training feature extractors for visual and textual understanding. Without sufficient temporal boundary annotations, it is non-trivial to learn universal video-text alignments. In this work, we explore multi-modal correlations derived from large-scale image-text data to facilitate generalisable VMR. To address the limitations of image-text pre-training models on capturing the video changes, we propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments. Whilst existing VMR methods are focusing on building temporal-aware video features, being aware of the text descriptions about the temporal changes is also critical but originally overlooked in pre-training by matching static images with sentences. Therefore, we extract visual context and spatial dynamic information from video frames and explicitly enforce their alignments with the phrases describing video changes (e.g. verb). By doing so, the potentially relevant visual and motion patterns in videos are encoded in the corresponding text embeddings (injected) so to enable more accurate video-text alignments. We conduct extensive experiments on two VMR benchmark datasets (Charades-STA and ActivityNet-Captions) and achieve state-of-the-art performances. Especially, VDI yields notable advantages when being tested on the out-of-distribution splits where the testing samples involve novel scenes and vocabulary.

updated: Tue Feb 28 2023 19:29:05 GMT+0000 (UTC)

published: Tue Feb 28 2023 19:29:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト