HANet: Hierarchical Alignment Networks for Video-Text Retrieval

Peng Wu; Xiangteng He; Mingqian Tang; Yiliang Lv; Jing Liu

HANet：ビデオテキスト検索のための階層的アライメントネットワーク

ビデオテキスト検索は、視覚言語理解において重要でありながら挑戦的なタスクです。これは、関連するビデオインスタンスとテキストインスタンスが互いに近接している共同埋め込みスペースを学習することを目的としています。現在のほとんどの作品は、ビデオレベルとテキストレベルの埋め込みに基づいてビデオテキストの類似性を測定するだけです。ただし、よりきめ細かい情報やローカルな情報を無視すると、表現が不十分になるという問題が発生します。いくつかの作品は、文章を解きほぐすことによってローカルの詳細を利用しますが、対応するビデオを見落とし、ビデオテキスト表現の非対称性を引き起こします。上記の制限に対処するために、ビデオテキストマッチングのために異なるレベルの表現を整列させるための階層的整列ネットワーク（HANet）を提案します。具体的には、最初にビデオとテキストを3つのセマンティックレベル、つまりイベント（ビデオとテキスト）、アクション（モーションと動詞）、エンティティ（外観と名詞）に分解します。これらに基づいて、個々のレベルがフレームと単語の間の配置に焦点を合わせ、ローカルレベルがビデオクリップとテキストコンテキストの間の配置に焦点を合わせ、グローバルレベルがビデオ全体とテキストの間の配置。さまざまなレベルの配置により、ビデオとテキストの間の細かい相関関係と粗い相関関係がキャプチャされ、3つのセマンティックレベル間の補足情報が利用されます。さらに、HANetは、主要なセマンティックの概念を明示的に学習することで、豊富に解釈できます。 2つの公開データセット、つまりMSR-VTTとVATEXでの広範な実験は、提案されたHANetが他の最先端の方法よりも優れていることを示しています。私たちのコードは公開されています。

Video-text retrieval is an important yet challenging task in vision-language understanding, which aims to learn a joint embedding space where related video and text instances are close to each other. Most current works simply measure the video-text similarity based on video-level and text-level embeddings. However, the neglect of more fine-grained or local information causes the problem of insufficient representation. Some works exploit the local details by disentangling sentences, but overlook the corresponding videos, causing the asymmetry of video-text representation. To address the above limitations, we propose a Hierarchical Alignment Network (HANet) to align different level representations for video-text matching. Specifically, we first decompose video and text into three semantic levels, namely event (video and text), action (motion and verb), and entity (appearance and noun). Based on these, we naturally construct hierarchical representations in the individual-local-global manner, where the individual level focuses on the alignment between frame and word, local level focuses on the alignment between video clip and textual context, and global level focuses on the alignment between the whole video and text. Different level alignments capture fine-to-coarse correlations between video and text, as well as take the advantage of the complementary information among three semantic levels. Besides, our HANet is also richly interpretable by explicitly learning key semantic concepts. Extensive experiments on two public datasets, namely MSR-VTT and VATEX, show the proposed HANet outperforms other state-of-the-art methods, which demonstrates the effectiveness of hierarchical representation and alignment. Our code is publicly available.

updated: Mon Jul 26 2021 09:28:50 GMT+0000 (UTC)

published: Mon Jul 26 2021 09:28:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト