JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

Kaizhi Zheng; Kaiwen Zhou; Jing Gu; Yue Fan; Jialu Wang; Zonglin Di; Xuehai He; Xin Eric Wang

JARVIS: 会話型エージェントのためのニューロシンボリック常識推論フレームワーク

実生活のタスクを実行するために会話型の具体化されたエージェントを構築することは、効果的な人間とエージェントのコミュニケーション、マルチモーダルな理解、長期的なシーケンシャルな意思決定などを必要とするため、長年にわたって非常に挑戦的な研究目標でした.スケーリングと一般化の問題、一方、エンドツーエンドのディープラーニングモデルはデータ不足とタスクの高度な複雑さに悩まされ、説明が難しいことがよくあります。両方の世界から利益を得るために、モジュール化され、一般化可能で、解釈可能な会話型の具体化されたエージェントのための Neuro-Symbolic Commonsense Reasoning (JARVIS) フレームワークを提案します。まず、言語の理解とサブゴールの計画のために大規模な言語モデル (LLM) を促し、視覚的な観察からセマンティックマップを構築することによって、シンボリック表現を取得します。次に、シンボリックモジュールは、タスクおよびアクションレベルの常識に基づいて、サブゴールの計画とアクションの生成を理由付けます。 TEACH データセットでの広範な実験により、JARVIS フレームワークの有効性と効率性が検証されました。JARVIS フレームワークは、ダイアログ履歴からの実行 (EDH)、ダイアログからの軌跡を含む、3 つのダイアログベースの具体化されたタスクすべてで最先端 (SOTA) の結果を達成します。 (TfD)、および 2 エージェントタスク完了 (TATC) (たとえば、私たちの方法は、EDH の目に見えない成功率を 6.1% から 15.8% に引き上げます)。さらに、タスクのパフォーマンスに影響を与える重要な要因を体系的に分析し、少数ショットの設定での方法の優位性を示します。当社の JARVIS モデルは、Alexa Prize SimBot パブリックベンチマークチャレンジで第 1 位にランクされています。

Building a conversational embodied agent to execute real-life tasks has been a long-standing yet quite challenging research goal, as it requires effective human-agent communication, multi-modal understanding, long-range sequential decision making, etc. Traditional symbolic methods have scaling and generalization issues, while end-to-end deep learning models suffer from data scarcity and high task complexity, and are often hard to explain. To benefit from both worlds, we propose a Neuro-Symbolic Commonsense Reasoning (JARVIS) framework for modular, generalizable, and interpretable conversational embodied agents. First, it acquires symbolic representations by prompting large language models (LLMs) for language understanding and sub-goal planning, and by constructing semantic maps from visual observations. Then the symbolic module reasons for sub-goal planning and action generation based on task- and action-level common sense. Extensive experiments on the TEACh dataset validate the efficacy and efficiency of our JARVIS framework, which achieves state-of-the-art (SOTA) results on all three dialog-based embodied tasks, including Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC) (e.g., our method boosts the unseen Success Rate on EDH from 6.1% to 15.8%). Moreover, we systematically analyze the essential factors that affect the task performance and also demonstrate the superiority of our method in few-shot settings. Our JARVIS model ranks first in the Alexa Prize SimBot Public Benchmark Challenge.

updated: Tue Aug 30 2022 02:10:50 GMT+0000 (UTC)

published: Sun Aug 28 2022 18:30:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト