DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Fenglin Liu; Xian Wu; Shen Ge; Xuancheng Ren; Wei Fan; Xu Sun; Yuexian Zou

DiMBERT: 絡み合っていないマルチモーダル注意を使用して、視覚言語に基づいた表現を学習する

視覚と言語 (VL) タスクでは、システムが視覚コンテンツと自然言語の両方を理解する必要があるため、視覚と言語のきめの細かい共同表現 (別名 VL 表現) を学習することが最も重要です。最近、VL 表現を学習し、多くのタスクで改善された結果を達成するために、さまざまな事前トレーニング済み VL モデルが提案されています。ただし、主流のモデルは、視覚と言語の両方の入力を同じ一連のアテンションマトリックスで処理します。その結果、生成された VL 表現は、1 つの共通の潜在空間に絡み合っています。この問題に取り組むために、視覚と言語に分離された注意空間を適用する新しいフレームワークである DiMBERT (Disentangled Multimodal-Attention BERT の略) を提案し、マルチモダリティの表現を明示的に解くことができます。絡み合っていない空間での視覚と言語の相関関係を強化するために、視覚情報をテキスト形式で表現する視覚概念を DiMBERT に導入します。このように、視覚的な概念は、2 つのモダリティ間のギャップを埋めるのに役立ちます。双方向言語モデリングとシーケンスからシーケンスへの言語モデリングという 2 つのタスクで、大量の画像と文のペアで DiMBERT を事前トレーニングします。事前トレーニングの後、DiMBERT はダウンストリームタスク用にさらに微調整されます。実験によると、DiMBERT は生成タスク (画像キャプションとビジュアルストーリーテリング) と分類タスク (表現の参照) の両方を含む 3 つのタスク (4 つのデータセット以上) で新しい最先端のパフォーマンスを設定することが示されています。提案された DiM (Disentangled Multimodal-Attention の略) モジュールは、既存の事前トレーニング済み VL モデルに簡単に組み込むことができ、パフォーマンスを向上させ、代表的なタスクを最大 5% 向上させることができます。最後に、体系的な分析を行い、DiM と導入された視覚的概念の有効性を示します。

Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space. To tackle this problem, we propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image-sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts.

updated: Fri Oct 28 2022 23:00:40 GMT+0000 (UTC)

published: Fri Oct 28 2022 23:00:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト