Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths

Yingqing He; Tianyu Yang; Yong Zhang; Ying Shan; Qifeng Chen

任意の長さの高忠実度ビデオ生成のための潜在ビデオ拡散モデル

AI によって生成されたコンテンツは最近多くの注目を集めていますが、写真のようにリアルなビデオ合成は依然として困難です。この分野では GAN と自己回帰モデルを使用した多くの試みが行われてきましたが、生成されたビデオの視覚的な品質と長さは満足のいくものではありません。拡散モデル (DM) は深層生成モデルの別のクラスであり、最近、さまざまな画像合成タスクで驚くべきパフォーマンスを達成しています。ただし、画像拡散モデルのトレーニングには通常、高性能を達成するためにかなりの計算リソースが必要です。その利点を活用しながらこの問題を緩和するために、純粋なノイズから高忠実度の任意の長さのビデオを合成する軽量のビデオ拡散モデルを導入します。具体的には、低次元の 3D 潜在空間で拡散とノイズ除去を実行することを提案します。これは、計算予算が限られている場合に、3D ピクセル空間での以前の方法よりも大幅に優れています。さらに、数十フレームでトレーニングされていますが、モデルは任意の長さ、つまり数千フレームのビデオを自己回帰的に生成できます。最後に、長時間のビデオを合成する際のパフォーマンスの低下を抑えるために、条件付き潜在摂動がさらに導入されます。さまざまなデータセットと生成された長さに関する広範な実験は、フレームワークが、GAN ベース、自己回帰ベース、および拡散ベースの方法を含む以前のアプローチよりもはるかに現実的で長いビデオをサンプリングできることを示唆しています。

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models (DMs) are another class of deep generative models and have recently achieved remarkable performance on various image synthesis tasks. However, training image diffusion models usually requires substantial computational resources to achieve a high performance, which makes expanding diffusion models to high-dimensional video synthesis tasks more computationally expensive. To ease this problem while leveraging its advantages, we introduce lightweight video diffusion models that synthesize high-fidelity and arbitrary-long videos from pure noise. Specifically, we propose to perform diffusion and denoising in a low-dimensional 3D latent space, which significantly outperforms previous methods on 3D pixel space when under a limited computational budget. In addition, though trained on tens of frames, our models can generate videos with arbitrary lengths, i.e., thousands of frames, in an autoregressive way. Finally, conditional latent perturbation is further introduced to reduce performance degradation during synthesizing long-duration videos. Extensive experiments on various datasets and generated lengths suggest that our framework is able to sample much more realistic and longer videos than previous approaches, including GAN-based, autoregressive-based, and diffusion-based methods.

updated: Wed Nov 23 2022 18:58:39 GMT+0000 (UTC)

published: Wed Nov 23 2022 18:58:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト