MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction

Shuliang Ning; Mengcheng Lan; Yanran Li; Chaofeng Chen; Qian Chen; Xunlai Chen; Xiaoguang Han; Shuguang Cui

必要なのは MIMO だけ : ビデオ予測のための強力なマルチインマルチアウトベースライン

ビデオ予測の既存のアプローチの主流は、Single-In-Single-Out (SISO) アーキテクチャに基づいてモデルを構築します。Single-In-Single-Out (SISO) アーキテクチャは、現在のフレームを入力として受け取り、次のフレームを再帰的に予測します。この方法では、将来のより長い期間を外挿しようとすると、パフォーマンスが大幅に低下することが多く、予測モデルの実際の使用が制限されます。あるいは、将来のすべてのフレームを一度に出力する MIMO (Multi-In-Multi-Out) アーキテクチャは、自然に再帰的な方法を破り、エラーの蓄積を防ぎます。ただし、ビデオ予測用の少数の MIMO モデルのみが提案されており、日付のために劣ったパフォーマンスしか達成していません。この分野における MIMO モデルの真の強みはあまり知られておらず、ほとんど調査されていません。それに動機づけられて、このホワイトペーパーでは包括的な調査を実施し、単純な MIMO アーキテクチャがどこまで可能かを徹底的に調査します。驚くべきことに、私たちの経験的研究は、単純な MIMO モデルが、特に長期的なエラーの蓄積に対処する際に、予想をはるかに超える大きなマージンで最先端の作業をしのぐことができることを明らかにしています。いくつかの方法と設計を検討した後、ビデオ予測の新しい標準を確立するために、純粋な Transformer をローカルの時空間ブロックと新しい多出力デコーダ (MIMO-VP) で拡張することに基づく新しい MIMO アーキテクチャを提案します。競争力の高い 4 つのベンチマーク (Moving MNIST、Human3.6M、Weather、KITTI) でモデルを評価します。広範な実験により、私たちのモデルはすべてのベンチマークで 1 位を獲得し、パフォーマンスが大幅に向上し、効率、量、質を含むすべての面で最高の SISO モデルを上回っていることが示されています。私たちのモデルは、ビデオ予測タスクの将来の研究を促進するための新しいベースラインとして役立つと信じています。コードが公開されます。

The mainstream of the existing approaches for video prediction builds up their models based on a Single-In-Single-Out (SISO) architecture, which takes the current frame as input to predict the next frame in a recursive manner. This way often leads to severe performance degradation when they try to extrapolate a longer period of future, thus limiting the practical use of the prediction model. Alternatively, a Multi-In-Multi-Out (MIMO) architecture that outputs all the future frames at one shot naturally breaks the recursive manner and therefore prevents error accumulation. However, only a few MIMO models for video prediction are proposed and they only achieve inferior performance due to the date. The real strength of the MIMO model in this area is not well noticed and is largely under-explored. Motivated by that, we conduct a comprehensive investigation in this paper to thoroughly exploit how far a simple MIMO architecture can go. Surprisingly, our empirical studies reveal that a simple MIMO model can outperform the state-of-the-art work with a large margin much more than expected, especially in dealing with longterm error accumulation. After exploring a number of ways and designs, we propose a new MIMO architecture based on extending the pure Transformer with local spatio-temporal blocks and a new multi-output decoder, namely MIMO-VP, to establish a new standard in video prediction. We evaluate our model in four highly competitive benchmarks (Moving MNIST, Human3.6M, Weather, KITTI). Extensive experiments show that our model wins 1st place on all the benchmarks with remarkable performance gains and surpasses the best SISO model in all aspects including efficiency, quantity, and quality. We believe our model can serve as a new baseline to facilitate the future research of video prediction tasks. The code will be released.

updated: Tue May 30 2023 04:55:09 GMT+0000 (UTC)

published: Fri Dec 09 2022 03:57:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト