NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Gengze Zhou; Yicong Hong; Qi Wu

NavGPT: 大規模な言語モデルを使用したビジョンと言語のナビゲーションにおける明示的推論

ChatGPT や GPT-4 のような大規模言語モデル (LLM) は、前例のない規模のデータでトレーニングされたため、モデルのスケーリングから顕著な推論能力が発現します。このような傾向は、無制限の言語データを使用して LLM をトレーニングする可能性を強調し、普遍的な身体化エージェントの開発を前進させました。この研究では、純粋に LLM ベースの命令に従うナビゲーションエージェントである NavGPT を導入し、視覚と言語によるナビゲーション (VLN) のゼロショットシーケンシャルアクション予測を実行することで、複雑な具体化されたシーンにおける GPT モデルの推論機能を明らかにします。。各ステップで、NavGPT は視覚的観察、ナビゲーション履歴、将来の探索可能な方向のテキスト記述を入力として受け取り、エージェントの現在のステータスを推論し、ターゲットに近づくかどうかを決定します。包括的な実験を通じて、NavGPT がナビゲーションのための高レベルの計画を明示的に実行できることを実証します。これには、命令のサブ目標への分解、ナビゲーションタスクの解決に関連する常識的知識の統合、観察されたシーンからのランドマークの特定、ナビゲーションの進行状況の追跡、計画による例外への適応などが含まれます。調整。さらに、LLM は、エージェントのナビゲーション履歴を考慮して、正確なトップダウンのメトリック軌道を描くだけでなく、パスに沿った観察とアクションから高品質のナビゲーション命令を生成できることを示します。 NavGPT を使用したゼロショット R2R タスクのパフォーマンスはトレーニング済みモデルにはまだ及ばないにもかかわらず、ビジュアルナビゲーションエージェントとして使用するために LLM のマルチモダリティ入力を適応させ、学習ベースのモデルに利益をもたらすために LLM の明示的な推論を適用することをお勧めします。

Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goal, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models.

updated: Mon May 29 2023 04:49:00 GMT+0000 (UTC)

published: Fri May 26 2023 14:41:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト