Robot Learning with Sensorimotor Pre-training

Ilija Radosavovic; Baifeng Shi; Letian Fu; Ken Goldberg; Trevor Darrell; Jitendra Malik

感覚運動事前トレーニングによるロボット学習

ロボット工学のための自己監視型感覚運動事前トレーニングアプローチを紹介します。 RPT と呼ばれる私たちのモデルは、一連の感覚運動トークンで動作するトランスフォーマーです。一連のカメラ画像、固有受容ロボットの状態、および過去のアクションが与えられると、インターリーブされたシーケンスをトークンにエンコードし、ランダムなサブセットをマスクアウトし、マスクアウトされたコンテンツを予測するモデルをトレーニングします。私たちは、ロボットが欠落しているコンテンツを予測できれば、ロボットが行動できるようにする物理世界の優れたモデルを取得しているという仮説を立てます。 RPT は、潜在的な視覚表現を操作するように設計されているため、予測が扱いやすく、10 倍の大きなモデルへのスケーリングと、実際のロボットでの 10 Hz の推論が可能になります。私たちのアプローチを評価するために、動作計画とモデルベースの把握アルゴリズムを組み合わせて、9 か月にわたって 20,000 の現実世界の軌道のデータセットを収集しました。このデータでの事前トレーニングは、ゼロからのトレーニングよりも常に優れたパフォーマンスを示し、ブロックスタッキングタスクで 2 倍の改善が得られ、有利なスケーリング特性があることがわかりました。

We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and past actions, we encode the interleaved sequence into tokens, mask out a random subset, and train a model to predict the masked-out content. We hypothesize that if the robot can predict the missing content it has acquired a good model of the physical world that can enable it to act. RPT is designed to operate on latent visual representations which makes prediction tractable, enables scaling to 10x larger models, and 10 Hz inference on a real robot. To evaluate our approach, we collect a dataset of 20,000 real-world trajectories over 9 months using a combination of motion planning and model-based grasping algorithms. We find that pre-training on this data consistently outperforms training from scratch, leads to 2x improvements in the block stacking task, and has favorable scaling properties.

updated: Fri Jun 16 2023 17:58:10 GMT+0000 (UTC)

published: Fri Jun 16 2023 17:58:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト