BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Ning Han; Jingjing Chen; Chuhao Shi; Yawen Zeng; Guangyi Xiao; Hao Chen

BiC-Net：テキスト-ビデオ検索のための効率的な時空間関係の学習

テキストビデオ検索のタスクは、言語とビジョンの対応を理解することを目的としており、近年ますます注目を集めています。以前の研究では、既製の2D / 3D-CNNを採用し、平均/最大プーリングを使用して、グローバルビデオ埋め込みとして集約された時間情報を使用して空間特徴を直接キャプチャするか、グラフベースのモデルと専門知識を導入してローカルの時空間を学習します。関係。ただし、既存の方法には2つの制限があります。1）グローバルビデオ表現は、単純な平均/最大プーリング方式でビデオ時間情報を学習し、2フレームごとに時間情報を完全に探索しません。 2）グラフベースのローカルビデオ表現は手作りであり、専門家の知識と経験的フィードバックに大きく依存しているため、より高レベルのきめ細かい視覚的関係を効果的にマイニングできない場合があります。これらの制限により、視覚的コンポーネントは同じであるが関係が異なるビデオを区別できなくなります。この問題を解決するために、新しいクロスモーダル検索フレームワークであるBi-Branch Complementary Network（BiC-Net）を提案します。これは、ローカルの時空間関係とグローバルを組み合わせることにより、テキストビデオモダリティを補完的に効果的にブリッジするようにトランスフォーマーアーキテクチャを変更します。時間情報。具体的には、ローカルビデオ表現は、複数のトランスフォーマーブロックと追加の残差ブロックを使用してエンコードされ、時空間関係機能を学習し、モジュールを時空間残差トランスフォーマー（SRT）と呼びます。一方、グローバルビデオ表現は、グローバルな時間的特徴を学習するために多層トランスブロックを使用してエンコードされます。最後に、時空間関係とグローバルな時間的特徴を、クロスモーダルテキストビデオ検索用の2つの埋め込みスペース上のテキスト特徴と位置合わせします。

The task of text-video retrieval aims to understand the correspondence between language and vision, has gained increasing attention in recent years. Previous studies either adopt off-the-shelf 2D/3D-CNN and then use average/max pooling to directly capture spatial features with aggregated temporal information as global video embeddings, or introduce graph-based models and expert knowledge to learn local spatial-temporal relations. However, the existing methods have two limitations: 1) The global video representations learn video temporal information in a simple average/max pooling manner and do not fully explore the temporal information between every two frames. 2) The graph-based local video representations are handcrafted, it depends heavily on expert knowledge and empirical feedback, which may not be able to effectively mine the higher-level fine-grained visual relations. These limitations result in their inability to distinguish videos with the same visual components but with different relations. To solve this problem, we propose a novel cross-modal retrieval framework, Bi-Branch Complementary Network (BiC-Net), which modifies transformer architecture to effectively bridge text-video modalities in a complementary manner via combining local spatial-temporal relation and global temporal information. Specifically, local video representations are encoded using multiple transformer blocks and additional residual blocks to learn spatio-temporal relation features, calling the module a Spatio-Temporal Residual transformer (SRT). Meanwhile, Global video representations are encoded using a multi-layer transformer block to learn global temporal features. Finally, we align the spatio-temporal relation and global temporal features with the text feature on two embedding spaces for cross-modal text-video retrieval.

updated: Wed Jun 01 2022 11:49:41 GMT+0000 (UTC)

published: Fri Oct 29 2021 08:23:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト