CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Shentong Mo; Jingfei Xia; Ihor Markevych

CAVL: 視覚と言語の対照的で適応的な表現の学習

視覚および言語の事前トレーニングは、視覚と言語の表現を一緒に学習することを目的としています。これは、視覚言語の下流のタスクに移すことができます。ただし、トレーニング前の段階では、言語と視覚の間に意味的な混乱が存在します。さらに、現在の事前トレーニング済みモデルは、ダウンストリームタスクに転送される際に、微調整のために多くの計算リソースを消費する傾向があります。この作業では、視覚と言語の対照的および適応的表現、すなわち CAVL を学習するためのシンプルだが効果的なアプローチを提示します。具体的には、事前トレーニングプロセス中に同じバッチ内の文全体と各画像の間の配置を学習するために、ペアワイズコントラストロスを導入します。微調整段階では、2 つの軽量適応ネットワークを導入してモデルパラメーターを減らし、トレーニング速度を上げて計算リソースを節約します。視覚的質問応答 (VQA)、視覚的常識推論 (VCR)、視覚的推論のための自然言語 (NLVR)、領域からフレーズへのグラウンディング (RPG)、テキストから画像への検索など、6 つの主要なダウンストリームタスクで CAVL を評価します。 (TIR)、およびゼロショットテキストから画像への検索 (ZS-TIR)。ベースラインと比較して、優れたパフォーマンスを実現し、微調整時間を大幅に短縮します (特に 76.17%)。広範な実験とアブレーション研究により、CAVLで提案されている対照的な事前トレーニングと適応微調整の効率が実証されています。

Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks. However, there exists semantic confusion between language and vision during the pre-training stage. Moreover, current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks. In this work, we present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL. Specifically, we introduce a pair-wise contrastive loss to learn alignments between the whole sentence and each image in the same batch during the pre-training process. At the fine-tuning stage, we introduce two lightweight adaptation networks to reduce model parameters and increase training speed for saving computation resources. We evaluate our CAVL on six main downstream tasks, including Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), Natural Language for Visual Reasoning (NLVR), Region-to-Phrase Grounding (RPG), Text-to-Image Retrieval (TIR), and Zero-shot Text-to-Image Retrieval (ZS-TIR). Compared to baselines, we achieve superior performance and reduce the fine-tuning time by a large margin (in particular, 76.17%). Extensive experiments and ablation studies demonstrate the efficiency of contrastive pre-training and adaptive fine-tuning proposed in our CAVL.

updated: Mon Apr 10 2023 05:54:03 GMT+0000 (UTC)

published: Mon Apr 10 2023 05:54:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト