HunYuan_tvr for Text-Video Retrieval

Shaobo Min; Weijie Kong; Rong-Cheng Tu; Dihong Gong; Chengfei Cai; Wenzhe Zhao; Chenyang Liu; Sixiao Zheng; Hongfa Wang; Zhifeng Li; Wei Liu

テキスト-ビデオ検索用のHunYuan_tvr

テキスト-ビデオ検索は、マルチモーダル理解において重要な役割を果たしており、近年ますます注目を集めています。ほとんどの既存の方法は、短いクリップとフレーズ、または単一のフレームと単語など、きめ細かいクロスモーダル関係を無視しながら、ビデオ全体と完全なキャプション文の間に対照的なペアを構築することに焦点を当てています。この論文では、HunYuan \ _tvrという名前の新しい方法を提案し、ビデオセンテンス、クリップフレーズ、およびフレームと単語の関係を同時に調査することにより、階層的なクロスモーダル相互作用を調査します。フレーム間の固有の意味関係を考慮して、HunYuan \ _tvrは最初に自己注意を実行してフレームごとの相関関係を調査し、相関関係のあるフレームをクリップレベルの表現に適応的にクラスター化します。次に、クリップごとの相関関係を調べて、クリップ表現をコンパクトな表現に集約し、ビデオをグローバルに記述します。このようにして、フレームクリップビデオの粒度の階層的なビデオ表現を構築し、単語ごとの相関関係を調べて、テキストモダリティの単語フレーズ文の埋め込みを形成することもできます。最後に、階層的対照学習は、クロスモーダル関係、つまり、フレームワード、クリップフレーズ、およびビデオセンテンスを探索するように設計されています。これにより、HunYuan\_tvrは包括的なマルチモーダル理解を実現できます。アダプティブラベルのノイズ除去と限界サンプルの強化によってさらに強化されたHunYuan\_tvrは、さまざまなベンチマークで新しい最先端の結果を取得します。たとえば、Rank @ 1は55.0％、57.8％、29.7％、52.1％、57.3％です。それぞれMSR-VTT、MSVD、LSMDC、DiDemo、およびActivityNet。

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short clips and phrases or single frame and word. In this paper, we propose a novel method, named HunYuan\_tvr, to explore hierarchical cross-modal interactions by simultaneously exploring video-sentence, clip-phrase, and frame-word relationships. Considering intrinsic semantic relations between frames, HunYuan\_tvr first performs self-attention to explore frame-wise correlations and adaptively clusters correlated frames into clip-level representations. Then, the clip-wise correlation is explored to aggregate clip representations into a compact one to describe the video globally. In this way, we can construct hierarchical video representations for frame-clip-video granularities, and also explore word-wise correlations to form word-phrase-sentence embeddings for the text modality. Finally, hierarchical contrastive learning is designed to explore cross-modal relationships,~i.e., frame-word, clip-phrase, and video-sentence, which enables HunYuan\_tvr to achieve a comprehensive multi-modal understanding. Further boosted by adaptive label denosing and marginal sample enhancement, HunYuan\_tvr obtains new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 57.8%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet respectively.

updated: Fri Apr 22 2022 06:55:35 GMT+0000 (UTC)

published: Thu Apr 07 2022 11:59:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト