SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Zirui Wang; Jiahui Yu; Adams Wei Yu; Zihang Dai; Yulia Tsvetkov; Yuan Cao

SimVLM：弱い監視による単純な視覚言語モデルの事前トレーニング

視覚的表現とテキスト表現の共同モデリングの最近の進歩により、ビジョン言語事前トレーニング（VLP）は、多くのマルチモーダルダウンストリームタスクで印象的なパフォーマンスを達成しました。ただし、クリーンな画像キャプションや地域ラベルなどの高価な注釈の要件は、既存のアプローチのスケーラビリティを制限し、複数のデータセット固有の目的の導入による事前トレーニング手順を複雑にします。この作業では、これらの制約を緩和し、Simple Visual Language Model（SimVLM）という名前の最小限の事前トレーニングフレームワークを提示します。以前の作業とは異なり、SimVLMは、大規模な弱い監視を活用することでトレーニングの複雑さを軽減し、単一のプレフィックス言語モデリングの目的でエンドツーエンドでトレーニングされます。追加のデータやタスク固有のカスタマイズを利用することなく、結果のモデルは以前の事前トレーニング方法を大幅に上回り、VQA（+ 3.74％vqa-を含む幅広い識別および生成ビジョン言語ベンチマークで新しい最先端の結果を達成します。スコア）、NLVR2（+ 1.17％の精度）、SNLI-VE（+ 1.37％の精度）、および画像キャプションタスク（+ 10.1％の平均CIDErスコア）。さらに、SimVLMが強力な一般化と転送能力を獲得し、オープンエンドの視覚的な質問応答やクロスモダリティ転送などのゼロショット動作を可能にすることを示します。

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

updated: Thu Mar 17 2022 06:39:59 GMT+0000 (UTC)

published: Tue Aug 24 2021 18:14:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト