AVIS: Autonomous Visual Information Seeking with Large Language Models

Ziniu Hu; Ahmet Iscen; Chen Sun; Kai-Wei Chang; Yizhou Sun; David A Ross; Cordelia Schmid; Alireza Fathi

AVIS: 大規模言語モデルによる自律的な視覚情報探索

この論文では、自律的な情報探索視覚的質問応答フレームワーク、AVIS を提案します。私たちの手法は、大規模言語モデル (LLM) を活用して、外部ツールの利用を動的に戦略化し、その出力を調査することで、提起された質問に対する答えを提供するために必要な不可欠な知識を取得します。「この画像に描かれている建物はどのような出来事を記念しているのですか?」など、外部の知識を必要とする視覚的な質問に答えるのは複雑な作業です。このタスクは、API の呼び出し、その応答の分析、情報に基づいた意思決定などの一連のアクションを必要とする組み合わせ検索空間を提供します。私たちはユーザー調査を実施して、このタスクに直面したときの人間の意思決定のさまざまな事例を収集します。次に、このデータは、次にどのツールを使用するかを動的に決定する LLM を利用したプランナー、ツールの出力から重要な情報を分析して抽出する LLM を利用した推論機能、および作業メモリコンポーネントの 3 つのコンポーネントで構成されるシステムを設計するために使用されます。取得した情報はプロセス全体にわたって保持されます。収集されたユーザーの行動は、2 つの重要な方法でシステムのガイドとして機能します。まず、ユーザーの一連の意思決定を分析して遷移グラフを作成します。このグラフは、個別の状態を描写し、各状態で使用できるアクションのセットを限定します。次に、ユーザーの意思決定の例を使用して、LLM を活用したプランナーと推論者に関連するコンテキストインスタンスを提供し、情報に基づいた意思決定を行う能力を強化します。 AVIS が、Infoseek や OK-VQA などの知識集約的な視覚的質問応答ベンチマークで最先端の結果を達成していることを示します。

In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.

updated: Tue Jun 13 2023 20:50:22 GMT+0000 (UTC)

published: Tue Jun 13 2023 20:50:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト