MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

Yatai Ji; Junjie Wang; Yuan Gong; Lin Zhang; Yanru Zhu; Hongfa Wang; Jiaxing Zhang; Tetsuya Sakai; Yujiu Yang

マルチモーダルな意味理解では、多くの場合、不確実性に対処する必要があります。これは、得られたメッセージが複数のターゲットを参照する傾向があることを意味します。このような不確実性は、インターモーダルおよびイントラモーダルの不確実性を含め、私たちの解釈にとって問題があります。特にラベル付けされていないデータセットの事前トレーニングとタスク固有のダウンストリームデータセットの微調整において、この不確実性のモデリングを研究する努力はほとんどありません。この論文では、シーケンスレベルの相互作用を利用することにより、確率分布エンコーダー（PDE）を介して確率分布としてすべてのモダリティの表現を投影します。既存の決定論的方法と比較して、このような不確実性モデリングは、より豊富なマルチモーダルセマンティック情報とより複雑な関係を伝えることができます。さらに、不確実性モデリングを一般的な事前トレーニングフレームワークと統合し、適切な事前トレーニングタスクを提案します。画像とテキストのマッチング (D-ITM)。微調整されたモデルは、画像テキストの検索、視覚的な質問応答、視覚的な推論、視覚的な含意など、挑戦的なダウンストリームタスクに適用され、最先端の結果を達成します。

Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our interpretation, including inter- and intra-modal uncertainty. Little effort has studied the modeling of this uncertainty, particularly in pre-training on unlabeled datasets and fine-tuning in task-specific downstream datasets. In this paper, we project the representations of all modalities as probabilistic distributions via a Probability Distribution Encoder (PDE) by utilizing sequence-level interactions. Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information and more complex relationships. Furthermore, we integrate uncertainty modeling with popular pre-training frameworks and propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). The fine-tuned models are applied to challenging downstream tasks, including image-text retrieval, visual question answering, visual reasoning, and visual entailment, and achieve state-of-the-art results.

updated: Sun Mar 26 2023 04:54:25 GMT+0000 (UTC)

published: Tue Oct 11 2022 10:54:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト