MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Zijia Zhao; Longteng Guo; Xingjian He; Shuai Shao; Zehuan Yuan; Jing Liu

MAMO: きめ細かい視覚言語表現学習のためのマスクされたマルチモーダルモデリング

マルチモーダル表現学習は、さまざまな視覚言語タスクで有望な改善を示しています。ほとんどの既存の方法は、効果的なきめの細かい画像とテキストの相互作用を欠いている一方で、ビジョンと言語の間のグローバルレベルの調整を構築することに優れています.この論文では、細粒度マルチモーダル表現を学習するために、共同でマスクされたマルチモーダルモデリング手法を提案します。この方法では、画像とテキストの入力に対してジョイントマスキングを実行し、マスキングされた信号を回復するために暗黙的ターゲットと明示的ターゲットの両方を統合します。暗黙のターゲットは、ビジョンと言語の統一された偏りのない目標を提供し、モデルはマスクされていない入力の潜在的なマルチモーダル表現を予測します。明示的なターゲットは、高レベルで意味的に意味のある情報 (画像パッチの運動量の視覚的特徴と単語トークンの概念) を回復することにより、マルチモーダル表現をさらに充実させます。このようなマスクされたモデリングプロセスを通じて、モデルはきめの細かいマルチモーダルインタラクションを学習するだけでなく、高レベルの表現と低レベルまたは中レベルの予測ターゲット (画像ピクセルなど) の間のセマンティックギャップを回避し、意味的に豊富なマルチモーダル表現を生成します。ゼロショット設定と微調整設定の両方で優れたパフォーマンスを発揮します。当社の事前トレーニング済みモデル (MAMO という名前) は、画像テキスト検索、視覚的質問応答、視覚的推論、教師付きの弱い視覚的グラウンディングなど、さまざまなダウンストリームビジョン言語タスクで最先端のパフォーマンスを実現します。

Multimodal representation learning has shown promising improvements on various vision-language tasks. Most existing methods excel at building global-level alignment between vision and language while lacking effective fine-grained image-text interaction. In this paper, we propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations. Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover. The implicit target provides a unified and debiased objective for vision and language, where the model predicts latent multimodal representations of the unmasked input. The explicit target further enriches the multimodal representations by recovering high-level and semantically meaningful information: momentum visual features of image patches and concepts of word tokens. Through such a masked modeling process, our model not only learns fine-grained multimodal interaction, but also avoids the semantic gap between high-level representations and low- or mid-level prediction targets (e.g. image pixels), thus producing semantically rich multimodal representations that perform well on both zero-shot and fine-tuned settings. Our pre-trained model (named MAMO) achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.

updated: Wed Jun 14 2023 07:26:20 GMT+0000 (UTC)

published: Sun Oct 09 2022 06:31:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト