Learning Reward Functions for Robotic Manipulation by Observing Humans

Minttu Alakuijala; Gabriel Dulac-Arnold; Julien Mairal; Jean Ponce; Cordelia Schmid

人間観察によるロボット操作の報酬関数の学習

人間のデモンストレーターがオブジェクトを操作する様子を観察することで、ロボットのポリシーを学習するための豊富でスケーラブルで安価なデータソースが提供されます。ただし、人間のビデオからロボットマニピュレーターにスキルを移すには、いくつかの課題があり、特に行動空間と観察空間の違いが挙げられます。この作業では、さまざまな操作タスクを解決する人間のラベルのないビデオを使用して、ロボット操作ポリシーのタスクに依存しない報酬関数を学習します。このトレーニングデータの多様性のおかげで、学習された報酬関数は、強化学習における有向探索のための有意義な事前情報を提供するために、以前には見られなかったロボットの実施形態および環境からの画像観測に十分に一般化されます。目標画像に関連する状態をスコアリングするための 2 つの方法を提案します: 直接時間回帰による方法と、時間対比学習で得られた埋め込み空間内の距離による方法です。関数を目標イメージで調整することにより、さまざまなタスクで 1 つのモデルを再利用できます。人間のビデオを活用してロボットを教える以前の研究とは異なり、私たちの方法である人間のオフライン学習距離 (HOLD) は、ロボット環境からのアプリオリデータも、一連のタスク固有の人間のデモンストレーションも、形態間の対応の事前定義された概念も必要としません。それでも、タスクの完了から得られるまばらな報酬のみを使用する場合と比較して、シミュレートされたロボットアームでのいくつかの操作タスクのトレーニングを加速できます。

Observing a human demonstrator manipulate objects provides a rich, scalable and inexpensive source of data for learning robotic policies. However, transferring skills from human videos to a robotic manipulator poses several challenges, not least a difference in action and observation spaces. In this work, we use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. Thanks to the diversity of this training data, the learned reward function sufficiently generalizes to image observations from a previously unseen robot embodiment and environment to provide a meaningful prior for directed exploration in reinforcement learning. We propose two methods for scoring states relative to a goal image: through direct temporal regression, and through distances in an embedding space obtained with time-contrastive learning. By conditioning the function on a goal image, we are able to reuse one model across a variety of tasks. Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of task-specific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.

updated: Tue Mar 07 2023 16:29:49 GMT+0000 (UTC)

published: Wed Nov 16 2022 16:26:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト