KITE: Keypoint-Conditioned Policies for Semantic Manipulation

Priya Sundaresan; Suneel Belkhale; Dorsa Sadigh; Jeannette Bohg

KITE: 意味操作のためのキーポイント条件付きポリシー

自然言語は人間とロボットに便利な共有インターフェイスを提供しますが、ロボットが言語コマンドを解釈して従うことができるようにすることは、操作における長年の課題のままです。高性能な命令に従うロボットを実現するための重要なステップは、意味論的操作を達成することです。これにより、ロボットは、「ぬいぐるみを持ち上げてください」のような高レベルの命令から、「ぬいぐるみの左耳をつかんでください」のようなより詳細な入力まで、さまざまな特異性で言語を解釈します。象。"これに取り組むために、シーンセマンティクス (ビジュアルシーン内の異なるオブジェクトを区別する) とオブジェクトセマンティクス (オブジェクト内の異なる部分を正確に位置特定する) の両方に注意を払う、セマンティクス操作のための 2 段階のフレームワークである Keypoints + structs to Execution (KITE) を提案します。実例）。 KITE は、まず 2D 画像のキーポイントを介してビジュアルシーン内の入力命令を接地し、下流のアクション推論に高精度のオブジェクト中心のバイアスを提供します。 RGB-D シーンの観察が提供されると、KITE は学習したキーポイント条件付きスキルを実行して指示を実行します。キーポイントの精度とパラメータ化されたスキルを組み合わせることで、シーンやオブジェクトの変化に一般化したきめ細かい操作が可能になります。実験的に、私たちは 3 つの現実世界環境 (長期 6-DoF テーブルトップ操作、意味把握、および高精度コーヒー作成タスク) で KITE を実証します。これらの設定では、KITE は指示に従って全体の成功率がそれぞれ 75%、70%、71% を達成しました。 KITE は、キーポイントベースのグラウンディングではなく、事前にトレーニングされた視覚言語モデルを選択したり、エンドツーエンドの視覚運動制御を優先してスキルを省略したりするフレームワークよりも優れたパフォーマンスを発揮しますが、すべて、より少ない、または同等の量のデモンストレーションからトレーニングされています。補足資料、データセット、コード、ビデオは、当社の Web サイト http://tinyurl.com/kite-site でご覧いただけます。

While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation, where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution (KITE), a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a 75%, 70%, and 71% overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations. Supplementary material, datasets, code, and videos can be found on our website: http://tinyurl.com/kite-site.

updated: Thu Jul 06 2023 11:02:52 GMT+0000 (UTC)

published: Thu Jun 29 2023 00:12:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト