Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery

Kristian Hartikainen; Xinyang Geng; Tuomas Haarnoja; Sergey Levine

半教師ありおよび教師なしスキル発見のための動的遠隔学習

強化学習では、タスクを学習するために報酬関数を手動で指定する必要があります。原則として、この報酬関数はタスクの目標を指定するだけで十分ですが、実際には強化学習は非常に時間がかかり、成功する結果に向かって滑らかな勾配を提供するように報酬関数が形作られない限り実行不可能ですらあります。特に画像などの生の観察からタスクを学習する場合、この整形を手作業で指定することは困難です。このホワイトペーパーでは、動的距離を自動的に学習する方法を検討します。これは、他の状態から特定の目標状態に到達するまでの予想される時間ステップ数の尺度です。これらの動的距離を使用して、新しい目標に到達するための適切な形の報酬機能を提供し、複雑なタスクを効率的に学習することができます。環境との教師なしの相互作用を使用して動的距離を学習する半教師付きレジームで動的距離を使用できることを示しますが、手動で設計された報酬機能なしで、タスクの目標を決定するために少量の選好監督を使用しますまたは目標の例。現実世界のロボットとシミュレーションの両方でこの方法を評価します。私たちの方法は、他の監督なしで、生の画像観察とたった10個の選好ラベルを使用して、現実世界の9自由度の手でバルブを回すことを学ぶことができることを示します。学んだスキルのビデオは、プロジェクトのWebサイト（https://sites.google.com/view/dynamical-distance-learning）にあります。

Reinforcement learning requires manual specification of a reward function to learn a task. While in principle this reward function only needs to specify the task goal, in practice reinforcement learning can be very time-consuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome. This shaping is difficult to specify by hand, particularly when the task is learned from raw observations, such as images. In this paper, we study how we can automatically learn dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other state. These dynamical distances can be used to provide well-shaped reward functions for reaching new goals, making it possible to learn complex tasks efficiently. We show that dynamical distances can be used in a semi-supervised regime, where unsupervised interaction with the environment is used to learn the dynamical distances, while a small amount of preference supervision is used to determine the task goal, without any manually engineered reward function or goal examples. We evaluate our method both on a real-world robot and in simulation. We show that our method can learn to turn a valve with a real-world 9-DoF hand, using raw image observations and just ten preference labels, without any other supervision. Videos of the learned skills can be found on the project website: https://sites.google.com/view/dynamical-distance-learning.

updated: Fri Feb 14 2020 10:16:54 GMT+0000 (UTC)

published: Thu Jul 18 2019 18:07:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト