Playful Interactions for Representation Learning

Sarah Young; Jyothish Pari; Pieter Abbeel; Lerrel Pinto

表現学習のための遊び心のある相互作用

視覚的模倣学習における重要な課題の1つは、特定のタスクについて専門家によるデモンストレーションを大量に収集することです。遠隔操作の方法と低コストの支援ツールの使用により、人間のデモンストレーションを収集する方法が簡単になっていますが、視覚的な表現とポリシーを学習するために、すべてのタスクで100〜1000のデモンストレーションが必要になることがよくあります。これに対処するために、タスク固有のデモンストレーションを必要としない別の形式のデータ、つまりplayに目を向けます。遊ぶことは、子供たちが早期学習で一連のスキルと行動、および視覚的表現を学ぶために使用する基本的な方法です。重要なのは、プレイデータは多様で、タスクにとらわれず、比較的安価に入手できることです。この作業では、下流のタスクの視覚的表現を学習するために、自己監視方式で遊び心のある相互作用を使用することを提案します。 19の多様な環境で2時間の遊び心のあるデータを収集し、自己予測学習を使用して視覚的表現を抽出します。これらの表現を前提として、プッシュとスタッキングという2つのダウンストリームタスクの模倣学習を使用してポリシーをトレーニングします。私たちの視覚的表現は、標準的な動作のクローン作成よりも一般化されており、必要なデモンストレーションの半分の数で同様のパフォーマンスを達成できることを示しています。ゼロからトレーニングされた私たちの表現は、ImageNetの事前トレーニングされた表現と比べて遜色ありません。最後に、ダウンストリームタスク学習に対するさまざまな事前トレーニングモードの影響に関する実験的分析を提供します。

One of the key challenges in visual imitation learning is collecting large amounts of expert demonstrations for a given task. While methods for collecting human demonstrations are becoming easier with teleoperation methods and the use of low-cost assistive tools, we often still require 100-1000 demonstrations for every task to learn a visual representation and policy. To address this, we turn to an alternate form of data that does not require task-specific demonstrations -- play. Playing is a fundamental method children use to learn a set of skills and behaviors and visual representations in early learning. Importantly, play data is diverse, task-agnostic, and relatively cheap to obtain. In this work, we propose to use playful interactions in a self-supervised manner to learn visual representations for downstream tasks. We collect 2 hours of playful data in 19 diverse environments and use self-predictive learning to extract visual representations. Given these representations, we train policies using imitation learning for two downstream tasks: Pushing and Stacking. We demonstrate that our visual representations generalize better than standard behavior cloning and can achieve similar performance with only half the number of required demonstrations. Our representations, which are trained from scratch, compare favorably against ImageNet pretrained representations. Finally, we provide an experimental analysis on the effects of different pretraining modes on downstream task learning.

updated: Mon Jul 19 2021 17:54:48 GMT+0000 (UTC)

published: Mon Jul 19 2021 17:54:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト