Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Hongwei Xue; Yupan Huang; Bei Liu; Houwen Peng; Jianlong Fu; Houqiang Li; Jiebo Luo

インターモダリティのプロービング：視覚言語の事前トレーニングのための自己注意による視覚解析

Vision-Language Pre-training（VLP）は、画像とテキストのペアからマルチモーダル表現を学習することを目的としており、微調整して下流の視覚言語タスクに役立ちます。主要なVLPモデルは、CNN-Transformerアーキテクチャを採用しています。このアーキテクチャは、画像をCNNに埋め込み、画像とテキストをTransformerで整列させます。視覚コンテンツ間の視覚的関係は、画像の理解において重要な役割を果たし、インターモーダルアライメント学習の基本です。ただし、CNNには、長距離依存性のモデリングにおける局所受容野の弱点のため、視覚的関係の学習に制限があります。したがって、視覚的関係とインターモーダルアライメントを学習するという2つの目的は、同じTransformerネットワークにカプセル化されています。このような設計は、各目的の特殊な特性を無視することにより、Transformerでのインターモーダルアライメント学習を制限する可能性があります。これに取り組むために、VLPの完全なトランスフォーマー視覚埋め込みを提案して、視覚的関係をよりよく学習し、インターモーダルアライメントをさらに促進します。具体的には、視覚と言語モダリティ（つまり、モダリティ間）の間の相互作用を測定するために、モダリティ間フロー（IMF）という名前のメトリックを提案します。また、モダリティ間の学習をさらに促進するために、TransformerでMasked Feature Regression（MFR）という名前の新しいマスキング最適化メカニズムを設計します。私たちの知る限り、これはVLPでの視覚的特徴学習に対するTransformerの利点を調査した最初の研究です。画像テキスト検索、視覚的質問応答（VQA）、視覚的含意、視覚的推論など、さまざまな視覚言語タスクでこの方法を検証します。私たちのアプローチは、最先端のVLPパフォーマンスを上回っているだけでなく、IMFメトリックの利点も示しています。

Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective. To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment. Specifically, we propose a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language modalities (i.e., inter-modality). We also design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter-modality learning. To the best of our knowledge, this is the first study to explore the benefit of Transformer for visual feature learning in VLP. We verify our method on a wide range of vision-language tasks, including Image-Text Retrieval, Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. Our approach not only outperforms the state-of-the-art VLP performance, but also shows benefits on the IMF metric.

updated: Mon Jun 28 2021 04:42:48 GMT+0000 (UTC)

published: Fri Jun 25 2021 08:04:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト