MAViL: Masked Audio-Video Learners

Po-Yao Huang; Vasu Sharma; Hu Xu; Chaitanya Ryali; Haoqi Fan; Yanghao Li; Shang-Wen Li; Gargi Ghosh; Jitendra Malik; Christoph Feichtenhofer

MAViL: マスクされたオーディオビデオ学習者

オーディオビジュアル表現をトレーニングするためのマスクされたオーディオビデオ学習者 (MAViL) を紹介します。私たちのアプローチは、自己監視の 3 つの相補的な形式で学習します。(1) マスクされたオーディオおよびビデオ入力データの再構成、(2) マスキングを使用したイントラおよびモーダル間の対比学習、および (3) 共同オーディオの再構成による自己トレーニング。最初の 2 つの目標から学習したビデオのコンテキスト化された機能。 MAViL を使用した事前トレーニングにより、モデルは視聴覚分類および検索タスクで適切に実行できるだけでなく、微調整や推論に他のモダリティからの情報を使用することなく、各モダリティの表現を単独で改善することもできます。経験的に、MAViL は AudioSet (53.1 mAP) と VGGSound (67.1% 精度) で新しい最先端を確立します。これらのベンチマークでは、自己監視型オーディオビジュアルモデルが外部監視を使用するオーディオビジュアルモデルよりも初めて優れています。

We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and inter-modal contrastive learning with masking, and (3) self-training by reconstructing joint audio-video contextualized features learned from the first two objectives. Pre-training with MAViL not only enables the model to perform well in audio-visual classification and retrieval tasks but also improves representations of each modality in isolation, without using information from the other modality for fine-tuning or inference. Empirically, MAViL sets a new state-of-the-art on AudioSet (53.1 mAP) and VGGSound (67.1% accuracy). For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on these benchmarks.

updated: Mon Jul 17 2023 05:44:35 GMT+0000 (UTC)

published: Thu Dec 15 2022 18:59:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト