VLMAE: Vision-Language Masked Autoencoder

Sunan He; Taian Guo; Tao Dai; Ruizhi Qiao; Chen Wu; Xiujun Shu; Bo Ren

VLMAE: ビジョン言語マスクオートエンコーダー

画像と言語のモデリングは、大規模な画像とテキストのペアデータからマルチモーダル表現を学習することを目的としたビジョン言語事前トレーニング (VLP) にとって非常に重要です。ただし、ほとんどの既存の VLP メソッドは、画像とテキスト間の情報の不均衡を無視しながら、画像とテキストの特徴間の相互作用のモデル化に焦点を当てているため、焦点バイアスに苦しんでいることがわかります。この問題に対処するために、ビジョン言語マスクオートエンコーダーフレームワーク (VLMAE) を提案します。 VLMAE は視覚的な生成学習を採用しており、モデルがきめ細かく偏りのない機能を取得するのを容易にします。以前の作業とは異なり、VLMAE はイメージ内のほぼすべての重要なパッチに注意を払い、より包括的な理解を提供します。広範な実験により、VLMAE は、視覚的な質問への応答、画像テキストの検索、視覚的なグラウンディングなど、さまざまなビジョン言語のダウンストリームタスクで優れたパフォーマンスを発揮し、トレーニング前の速度が最大 20% 向上することが実証されています。

Image and language modeling is of crucial importance for vision-language pre-training (VLP), which aims to learn multi-modal representations from large-scale paired image-text data. However, we observe that most existing VLP methods focus on modeling the interactions between image and text features while neglecting the information disparity between image and text, thus suffering from focal bias. To address this problem, we propose a vision-language masked autoencoder framework (VLMAE). VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features. Unlike the previous works, VLMAE pays attention to almost all critical patches in an image, providing more comprehensive understanding. Extensive experiments demonstrate that VLMAE achieves better performance in various vision-language downstream tasks, including visual question answering, image-text retrieval and visual grounding, even with up to 20% pre-training speedup.

updated: Fri Aug 19 2022 14:39:18 GMT+0000 (UTC)

published: Fri Aug 19 2022 14:39:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト