Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal Text-Video Retrieval

Ning Han; Jingjing Chen; Guangyi Xiao; Yawen Zeng; Chuhao Shi; Hao Chen

クロスモーダルテキストビデオ検索のための視覚的時空間関係強化ネットワーク

テキストとビデオの間のクロスモーダル検索のタスクは、ビジョンと言語の間の対応を理解することを目的としています。既存の研究は、テキストとビデオの埋め込みに基づいてテキストとビデオの類似性を測定する傾向に従っています。一般的に、ビデオ表現は、グローバルな視覚的特徴抽出のためにビデオフレームを2D / 3D-CNNにフィードするか、グラフ畳み込みネットワークを介してローカルレベルの細粒度フレーム領域を使用して単純な意味関係のみを学習することによって構築されます。ただし、これらのビデオ表現は、ビデオ表現の学習において視覚コンポーネント間の時空間関係を十分に活用していないため、同じ視覚コンポーネントを持つが関係が異なるビデオを区別することができません。この問題を解決するために、我々は視覚時空間関係強化ネットワーク（VSR-Net）を提案します。これは、コンポーネント間の時空間視覚関係を考慮して、テキストとビデオのモダリティを橋渡しするグローバルビデオ表現を強化する新しいクロスモーダル検索フレームワークです。。具体的には、視覚的な時空間関係は、視覚的な関係の特徴を学習するために、多層の時空間トランスフォーマーを使用してエンコードされます。クロスモーダルテキストビデオ検索のために、グローバルな視覚的およびきめ細かいリレーショナル機能を2つの埋め込みスペースのテキスト機能に合わせます。 MSR-VTTデータセットとMSVDデータセットの両方で広範な実験が行われます。結果は、提案されたモデルの有効性を示しています。今後の研究を容易にするためにコードをリリースします。

The task of cross-modal retrieval between texts and videos aims to understand the correspondence between vision and language. Existing studies follow a trend of measuring text-video similarity on the basis of textual and video embeddings. In common practice, video representation is constructed by feeding video frames into 2D/3D-CNN for global visual feature extraction or only learning simple semantic relations by using local-level fine-grained frame regions via graph convolutional network. However, these video representations do not fully exploit spatio-temporal relation among visual components in learning video representations, resulting in their inability to distinguish videos with the same visual components but with different relations. To solve this problem, we propose a Visual Spatio-Temporal Relation-Enhanced Network (VSR-Net), a novel cross-modal retrieval framework that considers the spatial-temporal visual relations among components to enhance global video representation in bridging text-video modalities. Specifically, visual spatio-temporal relations are encoded using a multi-layer spatio-temporal transformer to learn visual relational features. We align the global visual and fine-grained relational features with the text feature on two embedding spaces for cross-modal text-video retrieval. Extensive experimental are conducted on both MSR-VTT and MSVD datasets. The results demonstrate the effectiveness of our proposed model. We will release the code to facilitate future researches.

updated: Sat Nov 20 2021 13:59:43 GMT+0000 (UTC)

published: Fri Oct 29 2021 08:23:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト