Video-P2P: Video Editing with Cross-attention Control

Shaoteng Liu; Yuechen Zhang; Wenbo Li; Zhe Lin; Jiaya Jia

Video-P2P: 交差注意制御によるビデオ編集

この論文では、交差注意制御を備えた現実世界のビデオ編集のための新しいフレームワークである Video-P2P について説明します。注意制御は、事前にトレーニングされた画像生成モデルを使用した画像編集に効果的であることが証明されていますが、現在、公開されている大規模なビデオ生成モデルはありません。ビデオ-P2P は、画像生成拡散モデルを適応させてさまざまなビデオ編集タスクを完了することにより、この制限に対処します。具体的には、最初に Text-to-Set (T2S) モデルを調整して近似反転を完了し、次に共有無条件埋め込みを最適化して、少ないメモリコストで正確なビデオ反転を実現することを提案します。注意制御のために、ソースプロンプトとターゲットプロンプトに異なるガイダンス戦略を使用する、新しい分離ガイダンス戦略を導入します。ソースプロンプトの最適化された無条件埋め込みにより再構築能力が向上し、ターゲットプロンプトの初期化された無条件埋め込みにより編集性が向上します。この 2 つのブランチのアテンションマップを組み込むことで、詳細な編集が可能になります。これらの技術的な設計により、単語の入れ替え、プロンプトの洗練、注意力の再重み付けなど、さまざまなテキスト駆動型の編集アプリケーションが可能になります。 Video-P2P は、元のポーズやシーンを最適に維持しながら新しいキャラクターを生成するために、実世界のビデオでうまく機能します。これは、以前のアプローチよりも大幅に優れています。

This paper presents Video-P2P, a novel framework for real-world video editing with cross-attention control. While attention control has proven effective for image editing with pre-trained image generation models, there are currently no large-scale video generation models publicly available. Video-P2P addresses this limitation by adapting an image generation diffusion model to complete various video editing tasks. Specifically, we propose to first tune a Text-to-Set (T2S) model to complete an approximate inversion and then optimize a shared unconditional embedding to achieve accurate video inversion with a small memory cost. For attention control, we introduce a novel decoupled-guidance strategy, which uses different guidance strategies for the source and target prompts. The optimized unconditional embedding for the source prompt improves reconstruction ability, while an initialized unconditional embedding for the target prompt enhances editability. Incorporating the attention maps of these two branches enables detailed editing. These technical designs enable various text-driven editing applications, including word swap, prompt refinement, and attention re-weighting. Video-P2P works well on real-world videos for generating new characters while optimally preserving their original poses and scenes. It significantly outperforms previous approaches.

updated: Wed Mar 08 2023 17:53:49 GMT+0000 (UTC)

published: Wed Mar 08 2023 17:53:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト