R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair; Aravind Rajeswaran; Vikash Kumar; Chelsea Finn; Abhinav Gupta

R3M：ロボット操作のための普遍的な視覚的表現

多様な人間のビデオデータで事前にトレーニングされた視覚的表現が、下流のロボット操作タスクのデータ効率の高い学習をどのように可能にするかを研究します。具体的には、時間の制約のある学習、ビデオと言語の調整、およびL1ペナルティの組み合わせを使用して、Ego4D人間のビデオデータセットを使用して視覚的表現を事前トレーニングし、スパースでコンパクトな表現を促進します。結果として得られる表現R3Mは、ダウンストリームポリシー学習の凍結認識モジュールとして使用できます。 12のシミュレートされたロボット操作タスクのスイート全体で、R3Mは、ゼロからのトレーニングと比較して20％以上、CLIPやMoCoなどの最先端の視覚的表現と比較して10％以上タスクの成功を改善することがわかります。さらに、R3Mを使用すると、フランカエミカパンダの腕は、わずか20回のデモンストレーションで、実際の雑然としたアパートでさまざまな操作タスクを学習できます。コードと事前トレーニング済みモデルは、https：//tinyurl.com/robotr3mで入手できます。

We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models are available at https://tinyurl.com/robotr3m.

updated: Mon Apr 18 2022 22:39:13 GMT+0000 (UTC)

published: Wed Mar 23 2022 17:55:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト