Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale

Ram Ramrakhya; Eric Undersander; Dhruv Batra; Abhishek Das

Habitat-Web：大規模な人間のデモンストレーションから具体化されたオブジェクト検索戦略を学習する

新しい環境でオブジェクトを検索するために仮想ロボットを必要とするタスクで人間のデモンストレーションを模倣する大規模な研究を提示します-（1）ObjectGoalナビゲーション（例：「椅子を見つけて行く」）および（2）ピックアンドプレース（例：「マグカップを見つけ、マグカップを選び、カウンターを見つけ、マグカップをカウンターに置きます」）。まず、仮想テレオペレーションデータ収集インフラストラクチャを開発します。Webブラウザーで実行されているHabitatシミュレーターをAmazon Mechanical Turkに接続し、リモートユーザーが仮想ロボットを安全かつ大規模にテレオペレーションできるようにします。 ObjectNavの80kのデモンストレーションと、Pick＆Placeの12kのデモンストレーションを収集します。これは、シミュレーションまたは実際のロボットの既存の人間のデモンストレーションデータセットよりも桁違いに大きいものです。次に、質問に答えようとします。大規模な模倣学習（IL）（これまで不可能でした）と強化学習（RL）（現状）との比較はどうですか？ ObjectNavでは、70kの人間のデモンストレーションを使用したIL（ベルやホイッスルなし）が、240kのエージェントが収集した軌道を使用したRLよりも優れていることがわかりました。 ILのトレーニングを受けたエージェントは、効率的なオブジェクト検索動作を示します。部屋をのぞき、小さなオブジェクトのコーナーをチェックし、パノラマビューを取得するために所定の位置に回転します。これらはいずれも、RLエージェントによって目立つように表示されることはなく、これらを誘発します。 RLを介した動作には、面倒な報酬エンジニアリングが必要になります。最後に、精度とトレーニングデータのサイズのプロットは、有望なスケーリング動作を示しており、より多くのデモンストレーションを収集するだけで、最先端技術がさらに進歩する可能性が高いことを示唆しています。 Pick＆Placeでは、比較は非常に重要です。ILエージェントは9.5kの人間のデモンストレーションでトレーニングされた場合、新しいオブジェクトレセプタクルの場所でエピソードで約18％の成功を達成しますが、RLエージェントは0％を超えることができません。全体として、私たちの仕事は、大規模な模倣学習に投資するための説得力のある証拠を提供します。プロジェクトページ：https：//ram81.github.io/projects/habitat-web。

We present a large-scale study of imitating human demonstrations on tasks that require a virtual robot to search for objects in new environments -- (1) ObjectGoal Navigation (e.g. 'find & go to a chair') and (2) Pick&Place (e.g. 'find mug, pick mug, find counter, place mug on counter'). First, we develop a virtual teleoperation data-collection infrastructure -- connecting Habitat simulator running in a web browser to Amazon Mechanical Turk, allowing remote users to teleoperate virtual robots, safely and at scale. We collect 80k demonstrations for ObjectNav and 12k demonstrations for Pick&Place, which is an order of magnitude larger than existing human demonstration datasets in simulation or on real robots. Second, we attempt to answer the question -- how does large-scale imitation learning (IL) (which hasn't been hitherto possible) compare to reinforcement learning (RL) (which is the status quo)? On ObjectNav, we find that IL (with no bells or whistles) using 70k human demonstrations outperforms RL using 240k agent-gathered trajectories. The IL-trained agent demonstrates efficient object-search behavior -- it peeks into rooms, checks corners for small objects, turns in place to get a panoramic view -- none of these are exhibited as prominently by the RL agent, and to induce these behaviors via RL would require tedious reward engineering. Finally, accuracy vs. training data size plots show promising scaling behavior, suggesting that simply collecting more demonstrations is likely to advance the state of art further. On Pick&Place, the comparison is starker -- IL agents achieve ∼18% success on episodes with new object-receptacle locations when trained with 9.5k human demonstrations, while RL agents fail to get beyond 0%. Overall, our work provides compelling evidence for investing in large-scale imitation learning. Project page: https://ram81.github.io/projects/habitat-web.

updated: Fri Apr 08 2022 14:37:32 GMT+0000 (UTC)

published: Thu Apr 07 2022 15:31:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト