MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Rowan Zellers; Jiasen Lu; Ximing Lu; Youngjae Yu; Yanpeng Zhao; Mohammadreza Salehi; Aditya Kusupati; Jack Hessel; Ali Farhadi; Yejin Choi

メルローリザーブ：ビジョンと言語と音によるニューラルスクリプトの知識

人間として、私たちはマルチモーダルな世界をナビゲートし、すべての感覚から全体的な理解を構築します。オーディオ、字幕、ビデオフレームから学習する新しいトレーニング目標を通じて、時間の経過とともにビデオを共同で表現するモデルであるMERLOTリザーブを紹介します。ビデオを指定すると、テキストとオーディオのスニペットをMASKトークンに置き換えます。モデルは、正しいマスクアウトされたスニペットを選択することによって学習します。私たちの目標は、他の方法よりも早く学習し、大規模なパフォーマンスを発揮します。2,000万本のYouTube動画を事前にトレーニングします。経験的結果は、MERLOTリザーブが強力なマルチモーダル表現を学習することを示しています。微調整すると、Visual Commonsense Reasoning（VCR）、TVQA、およびKinetics-600に最先端の設定が行われます。以前の作業をそれぞれ5％、7％、1.5％上回っています。アブレーションは、これらのタスクが音声の事前トレーニングの恩恵を受けていることを示しています。VCRでさえ、画像を中心としたQAタスクです（音声なし）。さらに、私たちの目的は、すぐに使える予測を可能にし、強力なマルチモーダル常識の理解を明らかにします。完全にゼロショットの設定では、私たちのモデルは4つのビデオタスクで競争力のある結果を取得し、最近提案されたSlocated Reasoning（STAR）ベンチマークの教師ありアプローチよりも優れています。オーディオがより良い視覚言語表現を可能にする理由を分析し、将来の研究のための重要な機会を示唆しています。マルチモーダル事前トレーニングの倫理的および社会的影響について議論することで結論を下します。

As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

updated: Fri May 13 2022 14:25:04 GMT+0000 (UTC)

published: Fri Jan 07 2022 19:00:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト