COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Simon Ging; Mohammadreza Zolfaghari; Hamed Pirsiavash; Thomas Brox

COOT：ビデオテキスト表現学習のための協調的階層トランスフォーマー

多くの実際のビデオテキストタスクには、フレームと単語、クリップと文、ビデオと段落など、それぞれ異なるセマンティクスを持つさまざまなレベルの粒度が含まれます。このホワイトペーパーでは、この階層情報を活用し、さまざまなレベルの粒度とさまざまなモダリティ間の相互作用をモデル化するために、協調型階層トランスフォーマー（COOT）を提案します。この方法は、3つの主要なコンポーネントで構成されています。ローカルの時間コンテキスト（クリップ内などのレベル内）を活用するアテンションアウェア機能集約レイヤー、低レベルと高レベルのセマンティクス間の相互作用を学習するコンテキストトランスフォーマーです。（レベル間、たとえばクリップビデオ、センテンスパラグラフ）、およびビデオとテキストを接続するためのクロスモーダルサイクル一貫性の喪失。結果として得られる方法は、パラメータがほとんどないものの、いくつかのベンチマークで最新技術と比べて遜色ありません。すべてのコードは、https：//github.com/gingsi/coot-videotextでオープンソースで入手できます。

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

updated: Sun Nov 01 2020 18:54:09 GMT+0000 (UTC)

published: Sun Nov 01 2020 18:54:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト