Video Prediction at Multiple Scales with Hierarchical Recurrent Networks

Ani Karapetyan; Angel Villar-Corrales; Andreas Boltres; Sven Behnke

階層型リカレントネットワークを使用した複数のスケールでのビデオ予測

自律システムは、現在の環境を理解するだけでなく、たとえばキャプチャされたカメラフレームに基づいて、過去の状態を条件とする将来のアクションを予測できる必要があります。特定のタスクでは、近い将来、将来のビデオフレームなどの詳細な予測が必要になりますが、他のタスクでは、より長い期間のより抽象的な表現を予測することも有益です。ただし、既存のビデオ予測モデルは、主に短期間の詳細な結果の予測に焦点を合わせているため、ロボットの知覚と空間的推論には限られた用途しかありません。マルチスケール階層予測（MSPred）を提案します。これは、さまざまな時間スケールでさまざまなレベルの粒度の将来の可能な結果を同時に予測できる新しいビデオ予測モデルです。 MSPredは、空間的および時間的なダウンサンプリングを組み合わせることにより、ビデオフレーム予測の競争力のあるパフォーマンスを維持しながら、人間のポーズやオブジェクトの位置などの抽象的な表現を長期間にわたって効率的に予測できます。私たちの実験では、提案されたモデルが将来のビデオフレームだけでなく、ビンピッキングシーンやアクション認識データセットなどのさまざまなシナリオで他の表現（キーポイントや位置など）を正確に予測し、ビデオフレーム予測の一般的なアプローチを一貫して上回っていることを示します。さらに、MSPredのさまざまなモジュールと設計の選択の重要性を調査するためにアブレーション研究を実施します。再現性のある研究の精神に基づいて、ディープラーニングベースのビデオ予測の一般的なフレームワークであるVP-Suiteと、結果を再現するための事前トレーニング済みモデルをオープンソース化します。

Autonomous systems not only need to understand their current environment, but should also be able to predict future actions conditioned on past states, for instance based on captured camera frames. For certain tasks, detailed predictions such as future video frames are required in the near future, whereas for others it is beneficial to also predict more abstract representations for longer time horizons. However, existing video prediction models mainly focus on forecasting detailed possible outcomes for short time-horizons, hence being of limited use for robot perception and spatial reasoning. We propose Multi-Scale Hierarchical Prediction (MSPred), a novel video prediction model able to forecast future possible outcomes of different levels of granularity at different time-scales simultaneously. By combining spatial and temporal downsampling, MSPred is able to efficiently predict abstract representations such as human poses or object locations over long time horizons, while still maintaining a competitive performance for video frame prediction. In our experiments, we demonstrate that our proposed model accurately predicts future video frames as well as other representations (e.g. keypoints or positions) on various scenarios, including bin-picking scenes or action recognition datasets, consistently outperforming popular approaches for video frame prediction. Furthermore, we conduct an ablation study to investigate the importance of the different modules and design choices in MSPred. In the spirit of reproducible research, we open-source VP-Suite, a general framework for deep-learning-based video prediction, as well as pretrained models to reproduce our results.

updated: Thu Mar 17 2022 13:08:28 GMT+0000 (UTC)

published: Thu Mar 17 2022 13:08:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト