WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Yuqi Huo; Manli Zhang; Guangzhen Liu; Haoyu Lu; Yizhao Gao; Guoxing Yang; Jingyuan Wen; Heng Zhang; Baogui Xu; Weihao Zheng; Zongzheng Xi; Yueqian Yang; Anwen Hu; Jinming Zhao; Ruichen Li; Yida Zhao; Liang Zhang; Yuqing Song; Xin Hong; Wanqing Cui; Danyang Hou; Yingyan Li; Junyi Li; Peiyu Liu; Zheng Gong; Chuhao Jin; Yuchong Sun; Shizhe Chen; Zhiwu Lu; Zhicheng Dou; Qin Jin; Yanyan Lan; Wayne Xin Zhao; Ruihua Song; Ji-Rong Wen

WenLan：大規模なマルチモーダル事前トレーニングによるビジョンと言語の橋渡し

マルチモーダル事前トレーニングモデルは、近年、ビジョンと言語を橋渡しするために集中的に調査されています。ただし、それらのほとんどは、テキストと画像モダリティの間に強い意味相関が存在すると仮定することにより、画像とテキストのペア間のクロスモーダル相互作用を明示的にモデル化します。この強い仮定は実際のシナリオでは無効であることが多いため、大規模なマルチモーダル事前トレーニングのクロスモーダル相関を暗黙的にモデル化することを選択します。これは、私たちのチームが主導する中国のプロジェクト「WenLan」の焦点です。具体的には、画像とテキストのペアに対する弱い相関の仮定を使用して、クロスモーダル対照学習フレームワーク内でBriVLと呼ばれる2タワーの事前トレーニングモデルを提案します。単純な対照学習法を採用するOpenAICLIPとは異なり、最新の方法MoCoをクロスモーダルシナリオに適合させることにより、より高度なアルゴリズムを考案します。大規模なキューベースの辞書を構築することで、BriVLは限られたGPUリソースにより多くのネガティブサンプルを組み込むことができます。さらに、BriVLモデルを事前トレーニングするために、RUC-CAS-WenLanと呼ばれる大規模な中国語のマルチソース画像テキストデータセットを構築します。広範な実験により、事前にトレーニングされたBriVLモデルは、さまざまなダウンストリームタスクでUNITERとOpenAICLIPの両方よりも優れていることが示されています。

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.

updated: Fri Mar 19 2021 23:30:38 GMT+0000 (UTC)

published: Thu Mar 11 2021 09:39:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト