MagicVideo: Efficient Video Generation With Latent Diffusion Models

Daquan Zhou; Weimin Wang; Hanshu Yan; Weiwei Lv; Yizhe Zhu; Jiashi Feng

MagicVideo: 潜在拡散モデルによる効率的なビデオ生成

MagicVideo と呼ばれる、潜在拡散モデルに基づく効率的なテキストからビデオへの生成フレームワークを提示します。テキストの説明が与えられると、MagicVideo はテキストコンテンツとの関連性が高い写真のようにリアルなビデオクリップを生成できます。提案された効率的な潜在 3D U-Net 設計により、MagicVideo は単一の GPU カードで 256x256 空間解像度のビデオクリップを生成できます。これは最近のビデオ拡散モデル (VDM) よりも 64 倍高速です。 RGB 空間でゼロからビデオ生成をトレーニングする以前の作業とは異なり、低次元の潜在空間でビデオクリップを生成することを提案します。さらに、トレーニングを高速化するために、事前トレーニング済みのテキストから画像への生成 U-Net モデルのすべての畳み込み演算子の重みを利用します。これを実現するために、U-Net デコーダーをビデオデータに適応させる 2 つの新しい設計を導入します。画像からビデオへの配信調整用のフレーム単位の軽量アダプターと、フレームの時間的依存関係をキャプチャするための有向時間的注意モジュールです。生成プロセス全体は、事前トレーニングされたバリエーションオートエンコーダーの低次元の潜在空間内にあります。 MagicVideo は、現実的なビデオコンテンツと架空のコンテンツの両方を写真のようにリアルなスタイルで生成できることを実証しますが、品質と計算コストの点でトレードオフがあります。その他の例については、https://magicvideo.github.io/# を参照してください。

We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. Given a text description, MagicVideo can generate photo-realistic video clips with high relevance to the text content. With the proposed efficient latent 3D U-Net design, MagicVideo can generate video clips with 256x256 spatial resolution on a single GPU card, which is 64x faster than the recent video diffusion model (VDM). Unlike previous works that train video generation from scratch in the RGB space, we propose to generate video clips in a low-dimensional latent space. We further utilize all the convolution operator weights of pre-trained text-to-image generative U-Net models for faster training. To achieve this, we introduce two new designs to adapt the U-Net decoder to video data: a framewise lightweight adaptor for the image-to-video distribution adjustment and a directed temporal attention module to capture frame temporal dependencies. The whole generation process is within the low-dimension latent space of a pre-trained variation auto-encoder. We demonstrate that MagicVideo can generate both realistic video content and imaginary content in a photo-realistic style with a trade-off in terms of quality and computational cost. Refer to https://magicvideo.github.io/# for more examples.

updated: Sun Nov 20 2022 16:40:31 GMT+0000 (UTC)

published: Sun Nov 20 2022 16:40:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト