VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Hassan Akbari; Liangzhe Yuan; Rui Qian; Wei-Hong Chuang; Shih-Fu Chang; Yin Cui; Boqing Gong

VATT：生のビデオ、オーディオ、テキストからのマルチモーダル教師あり学習のためのトランスフォーマー

畳み込みのないTransformerアーキテクチャを使用して、ラベルのないデータからマルチモーダル表現を学習するためのフレームワークを紹介します。具体的には、ビデオ-オーディオ-テキストトランスフォーマー（VATT）は、生の信号を入力として受け取り、さまざまなダウンストリームタスクに役立つ十分に豊富なマルチモーダル表現を抽出します。マルチモーダル対照損失を使用してVATTをゼロからエンドツーエンドでトレーニングし、ビデオアクション認識、オーディオイベント分類、画像分類、およびテキストからビデオへの検索というダウンストリームタスクによってそのパフォーマンスを評価します。さらに、3つのモダリティ間で重みを共有することにより、モダリティにとらわれないシングルバックボーントランスフォーマーを研究します。畳み込みのないVATTが、ダウンストリームタスクで最先端のConvNetベースのアーキテクチャよりも優れていることを示します。特に、VATTのビジョンTransformerは、教師あり事前トレーニングを回避しながら、Kinetics-400で82.1％、Kinetics-600で83.6％、Moments in Timeで41.1％のトップ1精度を達成しました。画像分類に転送すると、同じTransformerを最初からトレーニングした場合の64.7％と比較して、ImageNetで78.7％のトップ1精度が得られ、ビデオと画像の間のドメインギャップにもかかわらずモデルの一般化可能性が示されます。 VATTのオーディオトランスフォーマーは、教師あり事前トレーニングなしでAudioSetで39.4％のmAPを達成することにより、波形ベースのオーディオイベント認識で新記録を樹立します。 VATTのソースコードは公開されています。

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600,and 41.1% on Moments in Time, new records while avoiding supervised pre-training. Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer from scratch, showing the generalizability of our model despite the domain gap between videos and images. VATT's audio Transformer also sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training. VATT's source code is publicly available.

updated: Fri Aug 13 2021 03:05:34 GMT+0000 (UTC)

published: Thu Apr 22 2021 17:07:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト