One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

Gregor Geigle; Chen Cecilia Liu; Jonas Pfeiffer; Iryna Gurevych

1 つがすべてに当てはまるわけではありません。視覚および言語タスクのための視覚エンコーダの相補性について

現在のマルチモーダルモデルは、ビジョンと言語 (V+L) タスクの解決を目的としており、主にビジョンエンコーダー (VE) を特徴抽出器として再利用しています。さまざまなアーキテクチャで、さまざまなデータや目的に基づいてトレーニングされた多くの VE が公開されていますが、それらは下流の V+L タスク向けに設計されていません。それにもかかわらず、現在の研究のほとんどは、単一の事前トレーニングされた VE が汎用エンコーダーとして機能できることを前提としています。この作業では、分析に焦点を当て、異なる VE 内に保存されている情報が補完的であるかどうか、つまり、モデルに複数の VE の機能を提供すると、ターゲットタスクのパフォーマンスが向上するかどうか、およびそれらがどのように組み合わされるかを理解することを目的としています。 6 つの下流 V+L タスクで 3 つの人気のある VE を徹底的に実験し、注意力と VE ドロップアウトパターンを分析します。私たちの分析は、多様な VE が相互に補完し、その結果、下流の V+L タスクのパフォーマンスが向上することを示唆していますが、その向上は単純なアンサンブル効果によるものではありません (つまり、エンコーダーの数を増やしてもパフォーマンスが必ずしも向上するとは限りません)。我々は、再利用されていないが、V+L タスク用に明示的に設計された将来の VE が、ターゲット V+L タスクのパフォーマンスを向上させる可能性があることを実証します。

Current multimodal models, aimed at solving Vision and Language (V+L) tasks, predominantly repurpose Vision Encoders (VE) as feature extractors. While many VEs -- of different architectures, trained on different data and objectives -- are publicly available, they are not designed for the downstream V+L tasks. Nonetheless, most current work assumes that a single pre-trained VE can serve as a general-purpose encoder. In this work, we focus on analysis and aim to understand whether the information stored within different VEs is complementary, i.e. if providing the model with features from multiple VEs can improve the performance on a target task, and how they are combined. We exhaustively experiment with three popular VEs on six downstream V+L tasks and analyze the attention and VE-dropout patterns. Our analyses suggest that diverse VEs complement each other, resulting in improved downstream V+L task performance, where the improvements are not due to simple ensemble effects (i.e. the performance does not always improve when increasing the number of encoders). We demonstrate that future VEs, which are not repurposed, but explicitly designed for V+L tasks, have the potential of improving performance on the target V+L tasks.

updated: Thu Jun 08 2023 15:42:13 GMT+0000 (UTC)

published: Wed Oct 12 2022 16:31:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト