AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

Difei Gao; Lei Ji; Luowei Zhou; Kevin Qinghong Lin; Joya Chen; Zihan Fan; Mike Zheng Shou

AssistGPT: 計画、実行、検査、学習ができる汎用マルチモーダルアシスタント

大規模言語モデル (LLM) に関する最近の研究は、一般的な NLP AI アシスタントの目覚ましい進歩につながりました。いくつかの研究では、より一般的なマルチモーダルユーザークエリに対処するために、モデルまたは API を計画および呼び出しするための LLM の使用をさらに調査しています。このような進歩にもかかわらず、視覚タスクの性質が多様であるため、複雑な視覚ベースのタスクは依然として困難なままです。この多様性は 2 つの側面に反映されています: 1) 推論パス。実際のアプリケーションの多くでは、クエリ自体を調べるだけでクエリを正確に分解することは困難です。通常は、特定のビジュアルコンテンツと各ステップの結果に基づいて計画を立てる必要があります。 2) 柔軟な入力と中間結果。入力フォームは実際のケースに柔軟に対応でき、単一の画像やビデオだけでなく、ビデオと画像の混合、たとえばユーザービュー画像といくつかの参照ビデオも含まれます。さらに、複雑な推論プロセスは、ビデオナレーションやセグメント化されたビデオクリップなど、多様なマルチモーダルな中間結果も生成します。このような一般的なケースに対処するために、インターリーブコードと言語推論アプローチを備えたマルチモーダル AI アシスタント AssistGPT を提案します。 LLM をさまざまなツールと統合するための計画、実行、検査、学習 (PEIL) と呼ばれます。具体的には、Planner は自然言語を使用して、現在の推論の進行状況に基づいて Executor のどのツールが次に実行すべきかを計画できます。 Inspector は、プランナーが特定のツールに適切な視覚情報を供給できるように支援する効率的なメモリマネージャーです。最後に、推論プロセス全体が複雑かつ柔軟であるため、学習者はモデルが自律的に探索して最適な解決策を発見できるように設計されています。 A-OKVQA および NExT-QA ベンチマークで実験を実施し、最先端の結果を達成しました。さらに、ショーケースは、ベンチマークで見つかった問題よりもはるかに複雑な問題を処理するシステムの能力を示しています。

Recent research on Large Language Models (LLMs) has led to remarkable advancements in general NLP AI assistants. Some studies have further explored the use of LLMs for planning and invoking models or APIs to address more general multi-modal user queries. Despite this progress, complex visual-based tasks still remain challenging due to the diverse nature of visual tasks. This diversity is reflected in two aspects: 1) Reasoning paths. For many real-life applications, it is hard to accurately decompose a query simply by examining the query itself. Planning based on the specific visual content and the results of each step is usually required. 2) Flexible inputs and intermediate results. Input forms could be flexible for in-the-wild cases, and involves not only a single image or video but a mixture of videos and images, e.g., a user-view image with some reference videos. Besides, a complex reasoning process will also generate diverse multimodal intermediate results, e.g., video narrations, segmented video clips, etc. To address such general cases, we propose a multi-modal AI assistant, AssistGPT, with an interleaved code and language reasoning approach called Plan, Execute, Inspect, and Learn (PEIL) to integrate LLMs with various tools. Specifically, the Planner is capable of using natural language to plan which tool in Executor should do next based on the current reasoning progress. Inspector is an efficient memory manager to assist the Planner to feed proper visual information into a specific tool. Finally, since the entire reasoning process is complex and flexible, a Learner is designed to enable the model to autonomously explore and discover the optimal solution. We conducted experiments on A-OKVQA and NExT-QA benchmarks, achieving state-of-the-art results. Moreover, showcases demonstrate the ability of our system to handle questions far more complex than those found in the benchmarks.

updated: Wed Jun 28 2023 05:00:35 GMT+0000 (UTC)

published: Wed Jun 14 2023 17:12:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト