Masked Vision and Language Modeling for Multi-modal Representation Learning

Gukyeong Kwon; Zhaowei Cai; Avinash Ravichandran; Erhan Bas; Rahul Bhotika; Stefano Soatto

マルチモーダル表現学習のための仮面視覚と言語モデリング

この論文では、視覚と言語 (V+L) 表現学習でマスク信号モデリングを使用する方法を研究します。マスク言語モデリング (MLM) とマスク画像モデリング (MIM) を個別に開発する代わりに、あるモダリティのマスクされた信号が別のモダリティの助けを借りて再構築される、マスクされた視覚と言語の共同モデリングを構築することを提案します。これは、画像とテキストの両方がほぼ同じ情報を異なる形式で伝えるという、画像とテキストのペアデータの性質によるものです。別のモダリティで条件付けられた 1 つのモダリティのマスクされた信号再構成は、言語トークンと画像パッチ間のクロスモーダルアラインメントを暗黙的に学習することもできます。さまざまな V+L タスクに関する実験では、提案された方法が、一般的な V+L アライメントの損失と共に、何百万もの事前トレーニングデータの領域で最先端のパフォーマンスを達成することが示されています。また、データが限られているシナリオでは、他の競合他社よりも大幅に優れています。

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we outperforms the other competitors by a significant margin in limited data scenarios.

updated: Tue Mar 14 2023 23:51:53 GMT+0000 (UTC)

published: Wed Aug 03 2022 15:11:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト