MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Vikram Voleti; Alexia Jolicoeur-Martineau; Christopher Pal

MCVD：予測、生成、および補間のためのマスクされた条件付きビデオ拡散

ビデオ予測は難しい作業です。現在の最先端（SOTA）生成モデルからのビデオフレームの品質は低くなる傾向があり、トレーニングデータを超えた一般化は困難です。さらに、既存の予測フレームワークは通常、無条件の生成や補間などの他のビデオ関連のタスクを同時に処理することはできません。この作業では、過去および/または将来のフレームを条件として、確率的条件付きスコアベースのノイズ除去拡散モデルを使用して、これらすべてのビデオ合成タスク用のマスクされた条件付きビデオ拡散（MCVD）と呼ばれる汎用フレームワークを考案します。過去のすべてのフレームまたは将来のすべてのフレームをランダムかつ独立してマスクする方法でモデルをトレーニングします。この斬新でありながら簡単なセットアップにより、幅広いビデオタスクを実行できる単一のモデルをトレーニングできます。具体的には次のとおりです。未来/過去の予測-未来/過去のフレームのみがマスクされている場合。無条件の生成-過去と未来の両方のフレームがマスクされている場合。および補間-過去のフレームも将来のフレームもマスクされていない場合。私たちの実験は、このアプローチがさまざまなタイプのビデオに対して高品質のフレームを生成できることを示しています。当社のMCVDモデルは、単純な非反復2D畳み込みアーキテクチャから構築されており、フレームのブロックを調整し、フレームのブロックを生成します。ブロック単位で自動回帰的に任意の長さのビデオを生成します。私たちのアプローチでは、標準のビデオ予測と補間ベンチマーク全体でSOTAの結果が得られ、トレーニングモデルの計算時間は4つ以下のGPUを使用して1〜12日で測定されます。プロジェクトページ：https：//mask-cond-video-diffusion.github.io;コード：https：//github.com/voletiv/mcvd-pytorch

Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction -- when only future/past frames are masked; unconditional generation -- when both past and future frames are masked; and interpolation -- when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using ≤ 4 GPUs. Project page: https://mask-cond-video-diffusion.github.io ; Code : https://github.com/voletiv/mcvd-pytorch

updated: Wed Oct 12 2022 19:33:40 GMT+0000 (UTC)

published: Thu May 19 2022 20:58:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト