An Empirical Study of Training End-to-End Vision-and-Language Transformers

Zi-Yi Dou; Yichong Xu; Zhe Gan; Jianfeng Wang; Shuohang Wang; Lijuan Wang; Chenguang Zhu; Pengchuan Zhang; Lu Yuan; Nanyun Peng; Zicheng Liu; Michael Zeng

エンドツーエンドのビジョンと言語のトランスフォーマーのトレーニングに関する実証的研究

ビジョンと言語（VL）の事前トレーニングは、さまざまなVLダウンストリームタスクで非常に効果的であることが証明されています。最近の作業では、完全なトランスフォーマーベースのVLモデルは、以前のリージョン機能ベースの方法よりも効率的であることが示されていますが、ダウンストリームタスクでのパフォーマンスが大幅に低下することがよくあります。このホワイトペーパーでは、マルチモーダルエンドツーエンドTransformERフレームワークであるMETERを紹介します。このフレームワークを通じて、完全なトランスベースのVLモデルをエンドツーエンドで設計および事前トレーニングする方法を調査します。具体的には、モデル設計を複数の次元に沿って分析します：ビジョンエンコーダー（例：CLIPViT、Swinトランスフォーマー）、テキストエンコーダー（例：RoBERTa、DeBERTa）、マルチモーダルフュージョンモジュール（例：統合された注意と共同注意）、アーキテクチャ設計（例：エンコーダーのみとエンコーダー-デコーダー）、および事前トレーニングの目的（例：マスクされた画像モデリング）。包括的な実験を実施し、高速な推論速度を維持しながら、パフォーマンスの高いVLトランスをトレーニングする方法についての洞察を提供します。特に、私たちの最高のモデルは、事前トレーニングに4Mの画像のみを使用して、VQAv2 test-stdセットで77.64％の精度を達成し、最先端の領域機能ベースのモデルを1.04％超え、以前のモデルを上回っています。 1.6％の最高の完全トランスベースモデル。コードとモデルはhttps://github.com/zdou0830/METERでリリースされています。

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIPViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and provide insights on how to train a performant VL transformer while maintaining fast inference speed. Notably, our best model achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based model by 1.04%, and outperforming the previous best fully transformer-based model by 1.6%. Code and models are released at https://github.com/zdou0830/METER.

updated: Thu Nov 25 2021 08:17:01 GMT+0000 (UTC)

published: Wed Nov 03 2021 17:55:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト