MagicVideo: Efficient Video Generation With Latent Diffusion Models

Daquan Zhou; Weimin Wang; Hanshu Yan; Weiwei Lv; Yizhe Zhu; Jiashi Feng

MagicVideo: 潜在拡散モデルによる効率的なビデオ生成

潜在拡散モデルに基づいた、MagicVideo と呼ばれる効率的なテキストからビデオへの生成フレームワークを紹介します。 MagicVideo は、指定されたテキストの説明と一致するスムーズなビデオクリップを生成できます。斬新で効率的な 3D U-Net 設計と低次元空間でのビデオ配信のモデリングにより、MagicVideo は 1 枚の GPU カード上で 256x256 の空間解像度でビデオクリップを合成でき、ビデオ拡散モデル (VDM) よりも約 64 倍少ない計算量で済みます。 ) FLOP の観点から。具体的には、RGB 空間でビデオモデルを直接トレーニングする既存の研究とは異なり、事前トレーニングされた VAE を使用してビデオクリップを低次元の潜在空間にマッピングし、拡散モデルを介してビデオの潜在コードの分布を学習します。さらに、画像タスクでトレーニングされた U-Net デノイザーをビデオデータに適応させるための 2 つの新しい設計を導入しました。1 つは画像からビデオへの分配を調整するためのフレーム単位の軽量アダプター、もう 1 つはフレーム間の時間依存性をキャプチャするための指向性時間的注意モジュールです。したがって、ビデオトレーニングを高速化するために、テキストから画像へのモデルから畳み込み演算子の有益な重みを利用できます。生成されたビデオのピクセルディザリングを改善するために、RGB 再構成を改善するための新しい VideoVAE オートエンコーダも提案します。私たちは広範な実験を実施し、MagicVideo が現実的なコンテンツまたは架空のコンテンツを含む高品質のビデオクリップを生成できることを実証しました。その他の例については、https://magicvideo.github.io/# を参照してください。

We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips that are concordant with the given text descriptions. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works that directly train video models in the RGB space, we use a pre-trained VAE to map video clips into a low-dimensional latent space and learn the distribution of videos' latent codes via a diffusion model. Besides, we introduce two new designs to adapt the U-Net denoiser trained on image tasks to video data: a frame-wise lightweight adaptor for the image-to-video distribution adjustment and a directed temporal attention module to capture temporal dependencies across frames. Thus, we can exploit the informative weights of convolution operators from a text-to-image model for accelerating video training. To ameliorate the pixel dithering in the generated videos, we also propose a novel VideoVAE auto-encoder for better RGB reconstruction. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content. Refer to https://magicvideo.github.io/# for more examples.

updated: Thu May 11 2023 11:23:03 GMT+0000 (UTC)

published: Sun Nov 20 2022 16:40:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト