From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Jiaxian Guo; Junnan Li; Dongxu Li; Anthony Meng Huat Tiong; Boyang Li; Dacheng Tao; Steven C. H. Hoi

画像からテキストプロンプトへ: 凍結された大規模言語モデルを使用したゼロショット VQA

大規模言語モデル (LLM) は、新しい言語タスクに対する優れたゼロショット一般化を実証しています。ただし、主にLLMとVQAタスク間のモダリティの切断とタスクの切断により、ゼロショットの視覚的質問応答（VQA）のためのLLMの効果的な利用は依然として困難です。視覚と言語のデータに関するエンドツーエンドのトレーニングは、切断を橋渡しする可能性がありますが、柔軟性がなく、計算コストが高くなります。この問題に対処するために、LLM がエンドツーエンドのトレーニングなしでゼロショット VQA タスクを実行できるように、前述のモダリティとタスクの切断を橋渡しできるプロンプトを提供するプラグアンドプレイモジュールである Img2Prompt を提案します。このようなプロンプトを提供するために、LLM に依存しないモデルをさらに採用して、イメージコンテンツと自己構築された質問と回答のペアを説明できるプロンプトを提供します。これにより、LLM がゼロショット VQA タスクを実行するように効果的に導くことができます。 Img2Prompt には次の利点があります。1) さまざまな LLM と柔軟に連携して VQA を実行できます。 2)~エンドツーエンドのトレーニングを必要としないため、ゼロショット VQA タスクに LLM を導入するコストが大幅に削減されます。 3) エンドツーエンドのトレーニングに依存する方法と同等またはそれ以上のパフォーマンスを達成します。たとえば、VQAv2 では、Flamingo Deepmind:Flamingo2022 を 5.6% 上回っています。困難な A-OKVQA データセットでは、私たちの方法は少数ショットの方法よりも 20% も優れています。

Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose Img2Prompt, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo Deepmind:Flamingo2022 by 5.6% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20%.

updated: Sat Mar 04 2023 12:42:54 GMT+0000 (UTC)

published: Wed Dec 21 2022 08:39:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト