Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks

Kousik Rajesh; Mrigank Raman; Mohammed Asad Karim; Pranit Chawla

ギャップを埋める: 複雑な視覚的推論タスクのためのブリッジアーキテクチャの機能を探る

最近、大規模言語モデルに基づくマルチモーダルアーキテクチャが急増しています。これは、LLM のゼロショット生成機能を活用し、画像の埋め込みをテキスト空間に投影し、自動回帰機能を使用して VQA などのタスクを解決します。、キャプション、画像検索。これらのアーキテクチャを、画像空間からテキスト空間に投影するため、「ブリッジアーキテクチャ」と名付けます。これらのモデルは、トランスフォーマーベースのマルチモーダルモデルをトレーニングする従来のレシピから逸脱しています。これには、共同注意または相互注意による大規模な事前トレーニングと複雑なマルチモーダル相互作用の使用が含まれます。ただし、ブリッジアーキテクチャの機能は、画像に関するきめ細かい分析を必要とする複雑な視覚的推論タスクではテストされていません。このプロジェクトでは、NLVR2 データセット上でこれらのブリッジアーキテクチャのパフォーマンスを調査し、最先端のトランスフォーマーベースのアーキテクチャと比較します。まず、オブジェクトレベルの機能を追加して、きめ細かいオブジェクト推論を容易にすることで、NLVR2 データセットの従来のブリッジアーキテクチャを拡張します。私たちの分析では、ブリッジアーキテクチャにオブジェクトレベルの機能を追加しても役に立たず、NLVR2 などの複雑な推論タスクで優れたパフォーマンスを発揮するには、マルチモーダルデータの事前トレーニングが鍵となることが示されています。また、最近開発されたブリッジアーキテクチャである LLaVA の初期結果をゼロショット設定で実証し、そのパフォーマンスを分析します。

In recent times there has been a surge of multi-modal architectures based on Large Language Models, which leverage the zero shot generation capabilities of LLMs and project image embeddings into the text space and then use the auto-regressive capacity to solve tasks such as VQA, captioning, and image retrieval. We name these architectures as "bridge-architectures" as they project from the image space to the text space. These models deviate from the traditional recipe of training transformer based multi-modal models, which involve using large-scale pre-training and complex multi-modal interactions through co or cross attention. However, the capabilities of bridge architectures have not been tested on complex visual reasoning tasks which require fine grained analysis about the image. In this project, we investigate the performance of these bridge-architectures on the NLVR2 dataset, and compare it to state-of-the-art transformer based architectures. We first extend the traditional bridge architectures for the NLVR2 dataset, by adding object level features to faciliate fine-grained object reasoning. Our analysis shows that adding object level features to bridge architectures does not help, and that pre-training on multi-modal data is key for good performance on complex reasoning tasks such as NLVR2. We also demonstrate some initial results on a recently bridge-architecture, LLaVA, in the zero shot setting and analyze its performance.

updated: Mon Jul 31 2023 03:57:31 GMT+0000 (UTC)

published: Mon Jul 31 2023 03:57:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト