Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Nina Shvetsova; Brian Chen; Andrew Rouditchenko; Samuel Thomas; Brian Kingsbury; Rogerio Feris; David Harwath; James Glass; Hilde Kuehne

すべてを一度に-ビデオ検索用のマルチモーダルFusionTransformer

ビデオデータからのマルチモーダル学習は、ゼロショットの取得や分類などのタスクを可能にする人間の注釈なしで意味的に意味のある埋め込みをトレーニングできるため、最近注目を集めています。この作業では、ビデオ、オーディオ、テキストなどの複数のモダリティ間で情報を交換し、それらを結合されたマルチモーダル表現に統合して、集約する埋め込みを取得することを学習する、マルチモーダルでモダリティにとらわれないフュージョントランスフォーマーアプローチを紹介します。マルチモーダル時間情報。位置やモダリティエンコーディングなどのアドオンを明示的に除外して、一度にすべての組み合わせ損失、単一のモダリティ、およびモダリティのペアを使用してシステムをトレーニングすることを提案します。テスト時に、結果のモデルは任意の数の入力モダリティを処理および融合できます。さらに、トランスの暗黙的なプロパティにより、さまざまな長さの入力を処理できます。提案されたアプローチを評価するために、大規模なHowTo100Mデータセットでモデルをトレーニングし、4つの挑戦的なベンチマークデータセットで結果の埋め込みスペースを評価して、ゼロショットビデオの取得とゼロショットビデオアクションのローカリゼーションで最先端の結果を取得します。

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization.

updated: Wed Dec 08 2021 18:14:57 GMT+0000 (UTC)

published: Wed Dec 08 2021 18:14:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト