ViperGPT: Visual Inference via Python Execution for Reasoning

Dídac Surís; Sachit Menon; Carl Vondrick

ViperGPT: 推論のための Python 実行による視覚的推論

視覚的なクエリへの回答は、視覚的な処理と推論の両方を必要とする複雑なタスクです。このタスクの主要なアプローチであるエンドツーエンドモデルは、この 2 つを明示的に区別せず、解釈可能性と一般化を制限します。モジュラープログラムを学習することは、有望な代替手段となりますが、プログラムとモジュールの両方を同時に学習することは難しいため、困難であることが証明されています。コード生成モデルを活用してビジョンと言語モデルをサブルーチンに構成し、任意のクエリの結果を生成するフレームワークである ViperGPT を紹介します。 ViperGPT は、提供された API を使用して利用可能なモジュールにアクセスし、後で実行される Python コードを生成してそれらを構成します。この単純なアプローチは、それ以上のトレーニングを必要とせず、さまざまな複雑な視覚タスクで最先端の結果を達成します。

Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.

updated: Tue Mar 14 2023 17:57:47 GMT+0000 (UTC)

published: Tue Mar 14 2023 17:57:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト