Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Jay Zhangjie Wu; Yixiao Ge; Xintao Wang; Weixian Lei; Yuchao Gu; Wynne Hsu; Ying Shan; Xiaohu Qie; Mike Zheng Shou

Tune-A-Video: テキストからビデオへの生成のための画像拡散モデルのワンショット調整

テキストから画像 (T2I) 生成の成功を再現するために、テキストからビデオ (T2V) 生成の最近の作業では、大規模なテキストビデオデータセットを使用して微調整を行います。ただし、このようなパラダイムは計算コストが高くなります。人間には、たった 1 つの見本から新しい視覚概念を学習する驚くべき能力があります。ここでは、新しい T2V 生成問題 x2014One-Shot Video Generation を研究します。この問題では、オープンドメイン T2V ジェネレーターをトレーニングするために、テキストとビデオのペアが 1 つだけ提示されます。直感的に、大量の画像データで事前トレーニングされた T2I 拡散モデルを T2V 生成に適応させることを提案します。 2 つの重要な観察結果があります。1) T2I モデルは、動詞の用語とよく一致する画像を生成できます。 2) T2I モデルを拡張して複数の画像を同時に生成すると、コンテンツの一貫性が驚くほど良好になります。連続的な動きをさらに学習するために、調整された Sparse-Causal Attention を使用した Tune-A-Video を提案します。これは、事前トレーニング済みの T2I 拡散モデルの効率的なワンショットチューニングを介して、テキストプロンプトからビデオを生成します。 Tune-A-Video は、被写体や背景の変更、属性の編集、スタイルの転送など、さまざまなアプリケーションで時間的に一貫性のあるビデオを作成することができ、この方法の汎用性と有効性を示しています。

To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problemx2014One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.

updated: Thu Dec 22 2022 09:43:36 GMT+0000 (UTC)

published: Thu Dec 22 2022 09:43:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト