Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

Fu-Yun Wang; Wenshuo Chen; Guanglu Song; Han-Jia Ye; Yu Liu; Hongsheng Li

Gen-L-Video: 時間的同時ノイズ除去によるマルチテキストから長いビデオへの生成

大規模な画像とテキストのデータセットと拡散モデルの進歩を活用して、テキスト駆動の生成モデルは画像の生成と編集の分野で目覚ましい進歩を遂げました。この研究では、テキスト駆動型の機能をマルチテキスト条件付きの長いビデオの生成と編集に拡張する可能性を検討しています。現在のビデオ生成および編集方法は、革新的ではありますが、多くの場合、非常に短いビデオ (通常は 24 フレーム未満) に限定されており、単一のテキスト条件に限定されています。現実世界のビデオは通常、それぞれが異なるセマンティック情報を持つ複数のセグメントで構成されていることを考慮すると、これらの制約により、その用途が大幅に制限されます。この課題に対処するために、私たちは Gen-L-Video と呼ばれる新しいパラダイムを導入します。このパラダイムは、追加のトレーニングを導入することなく、多様なセマンティックセグメントを持つ数百のフレームで構成されるビデオを生成および編集するための、既製のショートビデオ拡散モデルを拡張できます。コンテンツの一貫性を維持します。私たちは 3 つの主流のテキスト駆動ビデオ生成および編集手法を実装し、提案したパラダイムでさまざまなセマンティックセグメントが組み込まれた長いビデオに対応できるようにそれらを拡張しました。私たちの実験結果は、私たちのアプローチがビデオ拡散モデルの生成および編集機能を大幅に拡張し、将来の研究と応用に新たな可能性を提供することを明らかにしています。コードは https://github.com/GUN/Gen-L-Video で入手できます。

Leveraging large-scale image-text datasets and advancements in diffusion models, text-driven generative models have made remarkable strides in the field of image generation and editing. This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos. Current methodologies for video generation and editing, while innovative, are often confined to extremely short videos (typically less than 24 frames) and are limited to a single text condition. These constraints significantly limit their applications given that real-world videos usually consist of multiple segments, each bearing different semantic information. To address this challenge, we introduce a novel paradigm dubbed as Gen-L-Video, capable of extending off-the-shelf short video diffusion models for generating and editing videos comprising hundreds of frames with diverse semantic segments without introducing additional training, all while preserving content consistency. We have implemented three mainstream text-driven video generation and editing methodologies and extended them to accommodate longer videos imbued with a variety of semantic segments with our proposed paradigm. Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models, offering new possibilities for future research and applications. The code is available at https://github.com/G-U-N/Gen-L-Video.

updated: Mon May 29 2023 17:38:18 GMT+0000 (UTC)

published: Mon May 29 2023 17:38:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト