Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs

Emanuele Bugliarello; Ryan Cotterell; Naoaki Okazaki; Desmond Elliott

マスクされていないマルチモーダル事前トレーニング：ビジョンと言語BERTの統合

現在、大規模な事前トレーニングとタスク固有の微調整が、コンピュータービジョンと自然言語処理の多くのタスクの標準的な方法です。最近、AIのこれら2つの重要な領域の交差点での課題に取り組むために、視覚と言語のBERTを事前トレーニングするための多数の方法が提案されています。これらのモデルは、シングルストリームエンコーダーまたはデュアルストリームエンコーダーのいずれかに分類できます。これら2つのカテゴリの違いを調査し、単一の理論的フレームワークの下でそれらを統合する方法を示します。次に、制御された実験を実施して、5つのV＆LBERT間の経験的な違いを識別します。私たちの実験は、トレーニングデータとハイパーパラメータが、報告された結果間のほとんどの違いの原因であることを示していますが、埋め込み層がこれらの大規模なモデルで重要な役割を果たしていることも明らかにしています。

Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorized into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five V&L BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.

updated: Mon Nov 30 2020 18:55:24 GMT+0000 (UTC)

published: Mon Nov 30 2020 18:55:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト