A Control-Centric Benchmark for Video Prediction

Stephen Tian; Chelsea Finn; Jiajun Wu

ビデオ予測のためのコントロール中心のベンチマーク

ビデオは、具現化されたエージェントが世界のダイナミクスのモデルを学習するための有望な知識源です。大規模なディープネットワークは、人間の知覚的類似性またはピクセル単位の比較に基づくメトリックによって評価されるように、複雑なビデオデータを自己管理型の方法でモデル化するのにますます効果的になっています。ただし、現在のメトリックがダウンストリームタスクのパフォーマンスの正確な指標であるかどうかは不明のままです。経験的に、ロボット操作を計画する場合、実行の成功を予測する際に既存のメトリックが信頼できない可能性があることがわかりました。これに対処するために、サンプリングベースの計画を通じてシミュレートされたロボット操作の特定のモデルを評価する制御ベンチマークの形で、アクション条件付きのビデオ予測のベンチマークを提案します。当社のベンチマークである Video Prediction for Visual Planning (VP^2) には、11 のタスクカテゴリと 310 のタスクインスタンス定義を備えたシミュレートされた環境、完全な計画の実装、および各タスクカテゴリのスクリプト化された相互作用の軌跡を含むトレーニングデータセットが含まれています。私たちのベンチマークの中心的な設計目標は、単純なインターフェイス (単一の前方予測呼び出し) を公開することです。これにより、ほぼすべてのアクション条件付きビデオ予測モデルを簡単に評価できます。次に、ベンチマークを活用して、スケーリングモデルのサイズ、トレーニングデータの量、およびモデルアンサンブルの効果を 5 つの高性能ビデオ予測モデルを分析することで調査し、視覚的に多様な設定をモデル化する際にスケールが知覚品質を向上させることができることを発見しました。不確実性の認識は、計画のパフォーマンスにも役立ちます。

Video is a promising source of knowledge for embodied agents to learn models of the world's dynamics. Large deep networks have become increasingly effective at modeling complex video data in a self-supervised manner, as evaluated by metrics based on human perceptual similarity or pixel-wise comparison. However, it remains unclear whether current metrics are accurate indicators of performance on downstream tasks. We find empirically that for planning robotic manipulation, existing metrics can be unreliable at predicting execution success. To address this, we propose a benchmark for action-conditioned video prediction in the form of a control benchmark that evaluates a given model for simulated robotic manipulation through sampling-based planning. Our benchmark, Video Prediction for Visual Planning (VP^2), includes simulated environments with 11 task categories and 310 task instance definitions, a full planning implementation, and training datasets containing scripted interaction trajectories for each task category. A central design goal of our benchmark is to expose a simple interface -- a single forward prediction call -- so it is straightforward to evaluate almost any action-conditioned video prediction model. We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling by analyzing five highly-performant video prediction models, finding that while scale can improve perceptual quality when modeling visually diverse settings, other attributes such as uncertainty awareness can also aid planning performance.

updated: Wed Apr 26 2023 17:59:45 GMT+0000 (UTC)

published: Wed Apr 26 2023 17:59:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト