Real-World Image Variation by Aligning Diffusion Inversion Chain

Yuechen Zhang; Jinbo Xing; Eric Lo; Jiaya Jia

拡散反転チェーンの調整による現実世界の画像変化

最近の普及モデルの進歩により、テキストプロンプトを使用して高忠実度の画像を生成できるようになりました。ただし、生成された画像と現実世界の画像の間にはドメインギャップが存在し、これが現実世界画像の高品質なバリエーションを生成する際の課題となっています。私たちの調査により、このドメインギャップは異なる拡散プロセスにおける潜在的な分布ギャップに由来することが明らかになりました。この問題に対処するために、拡散モデルを利用して単一の画像見本から画像バリエーションを生成する Real-world Image variation by ALignment (RIVAL) と呼ばれる新しい推論パイプラインを提案します。私たちのパイプラインは、画像生成プロセスをソース画像の反転チェーンに合わせて調整することで、画像バリエーションの生成品質を向上させます。具体的には、段階的な潜在分布の調整が高品質のバリエーションを生成するために不可欠であることを実証します。これを達成するために、特徴の相互作用のためのクロスイメージセルフアテンション注入と、潜在特徴を調整するための段階的な分布正規化を設計します。これらの位置合わせプロセスを拡散モデルに組み込むことで、RIVAL はパラメータをさらに最適化することなく、高品質の画像バリエーションを生成できます。私たちの実験結果は、意味的条件の類似性と知覚の品質に関して、私たちが提案したアプローチが既存の方法よりも優れていることを示しています。さらに、この一般化された推論パイプラインは、画像条件付きテキストから画像への生成やサンプルベースの画像修復など、他の拡散ベースの生成タスクにも簡単に適用できます。

Recent diffusion model advancements have enabled high-fidelity images to be generated using text prompts. However, a domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images. Our investigation uncovers that this domain gap originates from a latents' distribution gap in different diffusion processes. To address this issue, we propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL) that utilizes diffusion models to generate image variations from a single image exemplar. Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain. Specifically, we demonstrate that step-wise latent distribution alignment is essential for generating high-quality variations. To attain this, we design a cross-image self-attention injection for feature interaction and a step-wise distribution normalization to align the latent features. Incorporating these alignment processes into a diffusion model allows RIVAL to generate high-quality image variations without further parameter optimization. Our experimental results demonstrate that our proposed approach outperforms existing methods with respect to semantic-condition similarity and perceptual quality. Furthermore, this generalized inference pipeline can be easily applied to other diffusion-based generation tasks, such as image-conditioned text-to-image generation and example-based image inpainting.

updated: Sat Jul 15 2023 08:09:02 GMT+0000 (UTC)

published: Tue May 30 2023 04:09:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト