Cross-Modal Contrastive Learning for Robust Reasoning in VQA

Qi Zheng; Chaoyue Wang; Daqing Liu; Dadong Wang; Dacheng Tao

VQAにおけるロバスト推論のためのクロスモーダル対照学習

視覚的質問応答 (VQA) におけるマルチモーダル推論は、最近急速な進歩を遂げています。ただし、ほとんどの推論モデルは、トレーニングデータから学習したショートカットに大きく依存しているため、困難な現実世界のシナリオでの使用が妨げられています。この論文では、不均衡な注釈によって引き起こされるショートカット推論を取り除き、全体的なパフォーマンスを改善するための、シンプルだが効果的なクロスモーダル対照学習戦略を提案します。粗い (画像、質問、回答) トリプレットレベルでの複雑な負のカテゴリを持つ既存の対照学習とは異なり、言語と画像モダリティ間の対応を活用して、よりきめ細かいクロスモーダル対照学習を実行します。各質問と回答 (QA) のペアを全体として扱い、一致する画像と一致しない画像を区別します。サンプリングバイアスの問題を軽減するために、画像間の接続グラフをさらに作成します。ポジティブペアごとに、異なるグラフからの画像をネガティブサンプルと見なし、マルチポジティブ対照学習のバージョンを差し引きます。私たちの知る限り、これは、繊細な手作業のルールを使用しない一般的な対照学習戦略が、堅牢な VQA 推論に貢献できることを明らかにした最初の論文です。いくつかの主流の VQA データセットでの実験は、最先端技術と比較して私たちの優位性を示しています。コードは https://github.com/qizhust/cmcl_vqa_pl で入手できます。

Multi-modal reasoning in visual question answering (VQA) has witnessed rapid progress recently. However, most reasoning models heavily rely on shortcuts learned from training data, which prevents their usage in challenging real-world scenarios. In this paper, we propose a simple but effective cross-modal contrastive learning strategy to get rid of the shortcut reasoning caused by imbalanced annotations and improve the overall performance. Different from existing contrastive learning with complex negative categories on coarse (Image, Question, Answer) triplet level, we leverage the correspondences between the language and image modalities to perform finer-grained cross-modal contrastive learning. We treat each Question-Answer (QA) pair as a whole, and differentiate between images that conform with it and those against it. To alleviate the issue of sampling bias, we further build connected graphs among images. For each positive pair, we regard the images from different graphs as negative samples and deduct the version of multi-positive contrastive learning. To our best knowledge, it is the first paper that reveals a general contrastive learning strategy without delicate hand-craft rules can contribute to robust VQA reasoning. Experiments on several mainstream VQA datasets demonstrate our superiority compared to the state of the arts. Code is available at https://github.com/qizhust/cmcl_vqa_pl.

updated: Mon Nov 21 2022 05:32:24 GMT+0000 (UTC)

published: Mon Nov 21 2022 05:32:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト