Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

Mingyang Zhou; Licheng Yu; Amanpreet Singh; Mengjiao Wang; Zhou Yu; Ning Zhang

検索ベースのマルチグラニュラーアラインメントによる教師なしビジョンと言語の事前トレーニング

Vision-and-Language（V + L）事前トレーニングモデルは、近年、さまざまなマルチモーダルベンチマークで大きな成功を収めています。ただし、既存のモデルの大部分は、並列画像テキストデータの大規模なセットで事前トレーニングを必要とします。これは、画像のみまたはテキストのみのデータと比較して、収集にコストがかかります。この論文では、教師なし視覚と言語の事前トレーニング（UVLP）を調査して、非並列の画像とテキストのデータセットからクロスモーダル表現を学習します。並列データなしで良好な教師なしV + L事前トレーニングにつながる2つの重要な要因を発見しました：（i）画像とテキストの共同入力（ii）全体的な画像とテキストの位置合わせ（非並列データの場合でも）。したがって、非パラレルテキストおよび画像用の新しい教師なしV + L事前トレーニングカリキュラムを提案します。まず、検索ベースのアプローチを介して弱く位置合わせされた画像テキストコーパスを構築し、次に、領域からタグ、領域からフレーズ、画像から文など、一連のマルチグラニュラー位置合わせの事前トレーニングタスクを適用します。アラインメント、2つのモダリティ間のギャップを埋めるため。包括的なアブレーション研究は、各粒度がより強力な事前トレーニング済みモデルを学習するのに役立つことを示しています。事前トレーニング済みのモデルを、VQA、NLVR2、Visual Entailment、RefCOCO +などの一連のV + Lダウンストリームタスクに適合させます。私たちのモデルは、監視されていない設定の下で、これらすべてのタスクで最先端のパフォーマンスを実現します。

Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data, which is costly to collect, compared to image-only or text-only data. In this paper, we explore unsupervised Vision-and-Language pre-training (UVLP) to learn the cross-modal representation from non-parallel image and text datasets. We found two key factors that lead to good unsupervised V+L pre-training without parallel data: (i) joint image-and-text input (ii) overall image-text alignment (even for non-parallel data). Accordingly, we propose a novel unsupervised V+L pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks, including region-to-tag, region-to-phrase, and image-to-sentence alignment, to bridge the gap between the two modalities. A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model. We adapt our pre-trained model to a set of V+L downstream tasks, including VQA, NLVR2, Visual Entailment, and RefCOCO+. Our model achieves the state-of-art performance in all these tasks under the unsupervised setting.

updated: Tue Mar 01 2022 05:34:01 GMT+0000 (UTC)

published: Tue Mar 01 2022 05:34:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト