An Empirical Study of Training End-to-End Vision-and-Language Transformers

Zi-Yi Dou; Yichong Xu; Zhe Gan; Jianfeng Wang; Shuohang Wang; Lijuan Wang; Chenguang Zhu; Pengchuan Zhang; Lu Yuan; Nanyun Peng; Zicheng Liu; Michael Zeng

エンドツーエンドのビジョンと言語のトランスフォーマーのトレーニングに関する実証的研究

ビジョンと言語（VL）の事前トレーニングは、さまざまなVLダウンストリームタスクで非常に効果的であることが証明されています。最近の作業では、完全なトランスベースのVLモデルは、以前のリージョン機能ベースの方法よりも効率的である可能性があることが示されていますが、ダウンストリームタスクでのパフォーマンスが大幅に低下することがよくあります。このホワイトペーパーでは、マルチモーダルエンドツーエンドTransformERフレームワークであるMETERを紹介します。このフレームワークを通じて、完全なトランスベースのVLモデルをエンドツーエンドで設計および事前トレーニングする方法を調査します。具体的には、ビジョンエンコーダー（CLIP-ViT、Swinトランスフォーマーなど）、テキストエンコーダー（RoBERTa、DeBERTaなど）、マルチモーダルフュージョンモジュール（アテンションとコアテンションの統合など）、アーキテクチャーなど、複数の次元に沿ってモデル設計を分析します。設計（例：エンコーダーのみとエンコーダー-デコーダー）、および事前トレーニングの目的（例：マスクされた画像モデリング）。包括的な実験を実施し、高性能VLトランスのトレーニング方法に関する洞察を提供します。 METERは、事前トレーニングに4Mの画像のみを使用して、VQAv2テスト標準セットで77.64％の精度を達成し、最先端の領域機能ベースのモデルを1.04％超え、以前の最高の完全トランスフォーマーを上回ります- 1.6％ベースのモデル。特に、さらにスケールアップすると、最高のVQAモデルが80.54％の精度を達成します。コードと事前トレーニング済みモデルは、https：//github.com/zdou0830/METERでリリースされています。

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and provide insights on how to train a performant VL transformer. METER achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based model by 1.04%, and outperforming the previous best fully transformer-based model by 1.6%. Notably, when further scaled up, our best VQA model achieves an accuracy of 80.54%. Code and pre-trained models are released at https://github.com/zdou0830/METER.

updated: Fri Mar 18 2022 03:29:10 GMT+0000 (UTC)

published: Wed Nov 03 2021 17:55:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト