Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Emanuele Bugliarello; Ryan Cotterell; Naoaki Okazaki; Desmond Elliott

マスクされていないマルチモーダル事前トレーニング: 視覚と言語の BERT のメタ分析と統合フレームワーク

大規模な事前トレーニングとタスク固有の微調整は、現在、コンピュータービジョンと自然言語処理の多くのタスクの標準的な方法論です。最近、視覚と言語の BERT を事前トレーニングして、AI のこれら 2 つの重要な領域の交差点での課題に取り組むための多数の方法が提案されています。これらのモデルは、シングルストリームまたはデュアルストリームエンコーダーに分類できます。これら 2 つのカテゴリの違いを研究し、単一の理論的枠組みの下でそれらを統合する方法を示します。次に、制御された実験を実施して、5 つの V&L BERT 間の経験的な違いを識別します。私たちの実験は、トレーニングデータとハイパーパラメータが報告された結果の違いのほとんどの原因であることを示していますが、これらの大規模なモデルでは埋め込み層が重要な役割を果たしていることも明らかにしています。

Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorised into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five V&L BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.

updated: Sun May 30 2021 23:37:58 GMT+0000 (UTC)

published: Mon Nov 30 2020 18:55:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト