Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction

Bohan Wu; Suraj Nair; Roberto Martin-Martin; Li Fei-Fei; Chelsea Finn

大規模ビデオ予測のための貪欲な階層型変分オートエンコーダ

さまざまなシーンに一般化するビデオ予測モデルにより、ロボットなどのインテリジェントエージェントは、モデルを使用した計画を通じてさまざまなタスクを実行できます。ただし、既存のビデオ予測モデルは小さなデータセットで有望な結果を生み出しましたが、大規模で多様なデータセットでトレーニングすると、深刻なアンダーフィッティングに悩まされます。この不十分な課題に対処するために、最初に、より大きなビデオ予測モデルをトレーニングする機能が、GPUまたはTPUのメモリ制約によってボトルネックになっていることがよくあることを確認します。並行して、深い階層の潜在変数モデルは、将来の観測のマルチレベルの確率論をキャプチャすることにより、より高品質の予測を生成できますが、そのようなモデルのエンドツーエンドの最適化は特に困難です。私たちの重要な洞察は、階層型オートエンコーダの貪欲でモジュール式の最適化が、メモリの制約と大規模なビデオ予測の最適化の課題の両方に同時に対処できることです。階層的オートエンコーダーの各レベルを貪欲にトレーニングすることにより、忠実度の高いビデオ予測を学習する方法である、貪欲な階層的変分オートエンコーダー（GHVAE）を紹介します。最先端のモデルと比較して、GHVAEは4つのビデオデータセットで予測パフォーマンスを17〜55％向上させ、実際のロボットタスクで成功率を35〜40％向上させ、モジュールを追加するだけでパフォーマンスを単調に向上させることができます。。

A video prediction model that generalizes to diverse scenes would enable intelligent agents such as robots to perform a variety of tasks via planning with the model. However, while existing video prediction models have produced promising results on small datasets, they suffer from severe underfitting when trained on large and diverse datasets. To address this underfitting challenge, we first observe that the ability to train larger video prediction models is often bottlenecked by the memory constraints of GPUs or TPUs. In parallel, deep hierarchical latent variable models can produce higher quality predictions by capturing the multi-level stochasticity of future observations, but end-to-end optimization of such models is notably difficult. Our key insight is that greedy and modular optimization of hierarchical autoencoders can simultaneously address both the memory constraints and the optimization challenges of large-scale video prediction. We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder. In comparison to state-of-the-art models, GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.

updated: Sat Jun 19 2021 07:25:28 GMT+0000 (UTC)

published: Sat Mar 06 2021 18:58:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト