Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Jie Jiang; Shaobo Min; Weijie Kong; Dihong Gong; Hongfa Wang; Zhifeng Li; Wei Liu

Tencent Text-Video Retrieval: マルチレベル表現による階層的クロスモーダルインタラクション

Text-Video Retrieval は、マルチモーダルな理解において重要な役割を果たしており、近年ますます注目を集めています。ほとんどの既存の方法は、ビデオ全体と完全なキャプション文の間の対照的なペアの構築に焦点を当てていますが、クリップフレーズやフレームワードなどのきめの細かいクロスモーダル関係を見落としています。この論文では、テキストビデオ検索のためのビデオセンテンス、クリップフレーズ、フレームワード間のマルチレベルのクロスモーダル関係を調査するために、Hierarchical Cross-Modal Interaction (HCMI) という名前の新しい方法を提案します。本質的なセマンティックフレーム関係を考慮して、HCMI は自己注意を実行してフレームレベルの相関を調査し、相関フレームをクリップレベルおよびビデオレベルの表現に適応的にクラスター化します。このようにして、HCMI は、フレームクリップビデオ粒度のマルチレベルビデオ表現を構築して、きめの細かいビデオコンテンツをキャプチャし、テキストモダリティのワードフレーズセンテンス粒度でマルチレベルのテキスト表現を構築します。ビデオとテキストのマルチレベル表現により、階層的対比学習は、HCMI がビデオとテキストのモダリティ。アダプティブラベルノイズ除去と限界サンプルエンハンスメントによってさらに強化された HCMI は、さまざまなベンチマークで新しい最先端の結果を達成します。それぞれ、VTT、MSVD、LSMDC、DiDemo、ActivityNet。

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.

updated: Wed Dec 14 2022 03:08:34 GMT+0000 (UTC)

published: Thu Apr 07 2022 11:59:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト