EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Junyi Chen; Longteng Guo; Jia Sun; Shuai Shao; Zehuan Yuan; Liang Lin; Dongyu Zhang

EVE: マスクされた予測とモダリティを意識した MoE による効率的な視覚言語の事前トレーニング

多様でマルチモーダルなデータから学習するためのスケーラブルなビジョン言語モデルを構築することは、依然として未解決の課題です。この論文では、Efficient Vision-languagE 基盤モデル、つまり EVE を紹介します。これは、1 つの統合事前トレーニングタスクによってのみ事前トレーニングされた 1 つの統合マルチモーダル Transformer です。具体的には、EVE は、モダリティ対応の疎な専門家混合 (MoE) モジュールと統合された共有 Transformer ネットワーク内で視覚と言語の両方をエンコードします。MoE モジュールは、異なる専門家に選択的に切り替えることでモダリティ固有の情報を取得します。視覚と言語の事前トレーニングタスクを統合するために、EVE は画像とテキストのペアに対してマスクされた信号モデリングを実行し、可視信号が与えられた場合にマスクされた信号、つまり画像ピクセルとテキストトークンを再構築します。このシンプルかつ効果的な事前トレーニング目標により、画像テキスト対比損失および画像テキスト照合損失を使用して事前トレーニングされたモデルと比較して、トレーニングが 3.5 倍高速化されます。統合されたアーキテクチャと事前トレーニングタスクの組み合わせにより、EVE はスケールアップが容易で、より少ないリソースとより速いトレーニング速度でより優れたダウンストリームパフォーマンスを実現します。 EVE は、そのシンプルさにも関わらず、視覚的な質問応答、視覚的な推論、画像テキストの検索など、さまざまな視覚言語の下流タスクで最先端のパフォーマンスを実現します。

Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.

updated: Fri Mar 01 2024 11:22:54 GMT+0000 (UTC)

published: Wed Aug 23 2023 07:36:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト