InFusion: Inject and Attention Fusion for Multi Concept Zero Shot Text based Video Editing

Anant Khandelwal

InFusion: マルチコンセプトゼロショットテキストベースビデオ編集のためのインジェクトとアテンションフュージョン

大規模なテキストから画像への拡散モデルは、入力画像の編集に使用されるテキストプロンプトに合わせて、多様な高品質画像を生成することに目覚ましい成功を収めています。しかし、これらのモデルをビデオに適用する場合、主な課題はフレーム間の時間的一貫性と一貫性を確保することです。この論文では、事前にトレーニングされた大規模な画像拡散モデルを活用したゼロショットテキストベースのビデオ編集フレームワークである InFusion を提案しました。私たちのフレームワークは、編集プロンプトに記載されているさまざまな概念をピクセルレベルで制御することで、複数の概念の編集を特にサポートしています。具体的には、ソースで取得した特徴と U-Net 残差ブロックからの編集プロンプトの差分をデコーダー層に注入します。これを注入されたアテンション機能と組み合わせると、ソースの内容をクエリしたり、未編集部分の注入とともに編集された概念をスケールしたりすることが可能になります。。編集は、マスク抽出とアテンションフュージョン戦略を使用してさらにきめ細かく制御され、ソースから編集部分を切り取り、編集プロンプトのためにノイズ除去パイプラインに貼り付けます。私たちのフレームワークは、トレーニングを必要としないため、編集用にワンショット調整されたモデルの低コストの代替品です。 LoRA を使用した一般化された画像モデル (Stable Diffusion v1.5) による複雑なコンセプト編集を実証しました。適応は、既存のすべての画像拡散技術と互換性があります。広範な実験結果により、高品質で時間的に一貫したビデオをレンダリングする際の既存の方法よりも有効であることが実証されています。

Large text-to-image diffusion models have achieved remarkable success in generating diverse high-quality images in alignment with text prompt used for editing the input image. But, when these models applied to video the main challenge is to ensure temporal consistency and coherence across frames. In this paper, we proposed InFusion, a framework for zero-shot text-based video editing leveraging large pre-trained image diffusion models. Our framework specifically supports editing of multiple concepts with the pixel level control over diverse concepts mentioned in the editing prompt. Specifically, we inject the difference of features obtained with source and edit prompt from U-Net residual blocks in decoder layers, this when combined with injected attention features make it feasible to query the source contents and scale edited concepts along with the injection of unedited parts. The editing is further controlled in fine-grained manner with mask extraction and attention fusion strategy which cuts the edited part from source and paste it into the denoising pipeline for editing prompt. Our framework is a low cost alternative of one-shot tuned models for editing since it does not require training. We demonstrated the complex concept editing with generalised image model (Stable Diffusion v1.5) using LoRA. Adaptation is compatible with all the existing image diffusion techniques. Extensive experimental results demonstrate the effectiveness over existing methods in rendering high-quality and temporally consistent videos.

updated: Wed Aug 02 2023 16:11:47 GMT+0000 (UTC)

published: Sat Jul 22 2023 17:05:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト