DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Shengming Yin; Chenfei Wu; Jian Liang; Jie Shi; Houqiang Li; Gong Ming; Nan Duan

DragNUWA: テキスト、画像、軌跡を統合することによるビデオ生成のきめ細かい制御

制御可能なビデオ生成は、近年大きな注目を集めています。ただし、主な制限が 2 つあります。まず、既存の作品のほとんどはテキスト、画像、または軌跡ベースの制御に重点を置いているため、ビデオでのきめ細かい制御を実現できません。第二に、軌道制御の研究はまだ初期段階にあり、ほとんどの実験は Human3.6M のような単純なデータセットで行われています。この制約により、オープンドメイン画像を処理し、複雑な曲線の軌跡を効果的に処理するモデルの機能が制限されます。この論文では、オープンドメインの拡散ベースのビデオ生成モデルである DragNUWA を提案します。既存の作品における制御の粒度が不十分であるという問題に取り組むために、テキスト、画像、軌跡情報を同時に導入し、意味論的、空間的、時間的観点からビデオコンテンツをきめ細かく制御できるようにします。現在の研究におけるオープンドメイン軌道制御が制限されているという問題を解決するために、我々は、任意の軌道のオープンドメイン制御を可能にする軌道サンプラー(TS)、異なる軌道を制御するマルチスケールフュージョン(MF)の3つの側面による軌道モデリングを提案します。粒度、および軌道に従って一貫したビデオを生成するアダプティブトレーニング (AT) 戦略。私たちの実験では DragNUWA の有効性が検証され、ビデオ生成におけるきめ細かい制御における優れたパフォーマンスが実証されました。ホームページのリンクは https://www.microsoft.com/en-us/research/project/dragnuwa/ です。

Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is https://www.microsoft.com/en-us/research/project/dragnuwa/

updated: Wed Aug 16 2023 01:43:41 GMT+0000 (UTC)

published: Wed Aug 16 2023 01:43:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト