InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Junyang Lin; An Yang; Yichang Zhang; Jie Liu; Jingren Zhou; Hongxia Yang

InterBERT：マルチモーダル事前トレーニングのための視覚と言語の相互作用

高レベルのマルチモーダル表現を学習するためのマルチモーダル事前トレーニングは、深層学習と人工知能に向けたさらなるステップです。この作業では、新しいモデル、つまりInterBERT（BERT for Interaction）を提案します。これは、一連のマルチモーダル事前トレーニング方法M6（MultiModality-to-MultiModality Multitask Mega-transformer）の最初のモデルです。モデルは、異なるモダリティの情報フロー間の相互作用をモデル化する強力な機能を備えています。シングルストリームインタラクションモジュールは、複数のモダリティの情報を効果的に処理でき、上部の2ストリームモジュールは、各モダリティの独立性を維持して、シングルモーダルタスクでのパフォーマンスの低下を回避します。マスクされたセグメントモデリング（MSM）、マスクされた領域モデリング（MRM）、画像とテキストのマッチング（ITM）を含む3つの事前トレーニングタスクを使用してモデルを事前トレーニングします。一連のビジョンと言語のダウンストリームタスクでモデルを微調整します。実験結果は、InterBERTが最新のマルチモーダル事前トレーニング方法を含む一連の強力なベースラインを上回っていることを示しています。分析は、MSMとMRMが事前トレーニングに効果的であり、私たちの方法がシングルモーダルタスクでBERTに匹敵するパフォーマンスを達成できることを示しています。さらに、中国語でのマルチモーダル事前トレーニング用の大規模データセットを提案し、中国初のマルチモーダル事前トレーニングモデルであるChineseInterBERTを開発します。中国最大のeコマースプラットフォームであるモバイル淘宝網から提案された310万の画像とテキストのペアのデータセットで中国語のInterBERTを事前トレーニングします。テキストベースの画像検索用にモデルを微調整し、最近、トピックベースの推奨のためにモデルをオンラインで展開しました。

Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

updated: Thu Apr 22 2021 11:20:26 GMT+0000 (UTC)

published: Mon Mar 30 2020 03:13:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト