ControlVideo: Training-free Controllable Text-to-Video Generation

Yabo Zhang; Yuxiang Wei; Dongsheng Jiang; Xiaopeng Zhang; Wangmeng Zuo; Qi Tian

ControlVideo: トレーニング不要の制御可能なテキストからビデオへの生成

テキスト駆動の拡散モデルは、画像生成において前例のない能力を解放しましたが、そのビデオ対応モデルは、時間モデリングの過剰なトレーニングコストにより依然として遅れをとっています。トレーニングの負担に加えて、生成されたビデオは、特に長いビデオ合成において、外観の不一致や構造的なちらつきの問題にも悩まされます。これらの課題に対処するために、私たちは、自然で効率的なテキストからビデオへの生成を可能にする ControlVideo と呼ばれるトレーニング不要のフレームワークを設計しました。 ControlNet から応用された ControlVideo は、入力モーションシーケンスからの大まかな構造の一貫性を活用し、ビデオ生成を改善する 3 つのモジュールを導入しています。まず、フレーム間の外観の一貫性を確保するために、ControlVideo はセルフアテンションモジュールに完全なクロスフレームインタラクションを追加します。次に、ちらつき効果を軽減するために、交互フレームでフレーム補間を使用するインターリーブフレームスムーザーが導入されています。最後に、長いビデオを効率的に作成するために、全体的な一貫性を持って各短いクリップを個別に合成する階層サンプラーを利用します。これらのモジュールを活用した ControlVideo は、広範なモーションプロンプトペアにおいて量的および質的に最先端のパフォーマンスを上回ります。特に、効率的な設計のおかげで、1 台の NVIDIA 2080Ti を使用して、短いビデオと長いビデオの両方を数分以内に生成します。コードは https://github.com/YBYBZhang/ControlVideo で入手できます。

Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a training-free framework called ControlVideo to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.

updated: Mon May 22 2023 14:48:53 GMT+0000 (UTC)

published: Mon May 22 2023 14:48:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト