MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Rowan Zellers; Jiasen Lu; Ximing Lu; Youngjae Yu; Yanpeng Zhao; Mohammadreza Salehi; Aditya Kusupati; Jack Hessel; Ali Farhadi; Yejin Choi

メルローリザーブ：ビジョンと言語と音によるニューラルスクリプトの知識

人間として、私たちはすべての感覚を通して世界をナビゲートし、それぞれからの知覚入力を使用して他の感覚を修正します。オーディオ、字幕、ビデオフレームから学習する新しいトレーニング目標を通じて、時間の経過とともにビデオを共同で表現するモデルであるMERLOTリザーブを紹介します。ビデオを指定すると、テキストとオーディオのスニペットをMASKトークンに置き換えます。モデルは、正しいマスクアウトされたスニペットを選択することによって学習します。私たちの目標は、他の方法よりも早く学習し、大規模なパフォーマンスを発揮します。2,000万本のYouTube動画を事前にトレーニングします。経験的結果は、MERLOT Reserveが、すべての構成モダリティを通じてビデオに関する強力な表現を学習することを示しています。微調整すると、VCRとTVQAの両方に新しい最先端技術が設定され、以前の作業をそれぞれ5％と7％上回ります。アブレーションは、両方のタスクが音声の事前トレーニングの恩恵を受けていることを示しています。VCRでさえ、画像を中心としたQAタスクです（音声なし）。さらに、私たちの目的は、すぐに使える予測を可能にし、強力なマルチモーダル常識の理解を明らかにします。完全にゼロショットの設定では、私たちのモデルは4つのビデオ理解タスクで競争力のある結果を取得し、最近提案されたSlocated Reasoning（STAR）ベンチマークの教師ありアプローチよりも優れています。音声を組み込むことが視覚言語表現の改善につながる理由を分析し、将来の研究に重要な機会を示唆します。マルチモーダル事前トレーニングの倫理的および社会的影響について議論することで結論を下します。

As humans, we navigate the world through all our senses, using perceptual input from each one to correct the others. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong representations about videos through all constituent modalities. When finetuned, it sets a new state-of-the-art on both VCR and TVQA, outperforming prior work by 5% and 7% respectively. Ablations show that both tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video understanding tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why incorporating audio leads to better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

updated: Fri Jan 07 2022 19:00:21 GMT+0000 (UTC)

published: Fri Jan 07 2022 19:00:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト