Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

Narek Tumanyan; Michal Geyer; Shai Bagon; Tali Dekel

テキスト主導の画像から画像への変換のためのプラグアンドプレイ拡散機能

大規模なテキストから画像への生成モデルは、生成 AI の進化における革新的なブレークスルーであり、非常に複雑な視覚的概念を伝える多様な画像を合成できるようになりました。ただし、このようなモデルを現実世界のコンテンツ作成タスクに活用する際の重要な課題は、生成されたコンテンツをユーザーが制御できるようにすることです。このホワイトペーパーでは、テキストから画像への合成を画像から画像への変換の領域にまで引き上げる新しいフレームワークを提示します。ガイダンス画像とターゲットテキストプロンプトが与えられると、この方法は事前にトレーニングされたテキストの力を利用します。ソース画像のセマンティックレイアウトを維持しながら、ターゲットテキストに準拠した新しい画像を生成するための画像拡散モデル。具体的には、モデル内の空間的特徴とその自己注意を操作することで、生成された構造をきめ細かく制御できることを観察し、経験的に実証します。これにより、ガイダンス画像から抽出された特徴がターゲット画像の生成プロセスに直接注入され、トレーニングや微調整を必要とせず、実際のガイダンス画像と生成されたガイダンス画像の両方に適用できる、シンプルで効果的なアプローチが得られます。スケッチ、ラフドローイング、アニメーションを現実的な画像に変換する、特定の画像内のオブジェクトのクラスと外観を変更する、照明や色などの全体的な品質を変更するなど、多目的なテキストガイド付き画像変換タスクで高品質の結果を示します。 .

Large-scale text-to-image generative models have been a revolutionary breakthrough in the evolution of generative AI, allowing us to synthesize diverse images that convey highly complex visual concepts. However, a pivotal challenge in leveraging such models for real-world content creation tasks is providing users with control over the generated content. In this paper, we present a new framework that takes text-to-image synthesis to the realm of image-to-image translation -- given a guidance image and a target text prompt, our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text, while preserving the semantic layout of the source image. Specifically, we observe and empirically demonstrate that fine-grained control over the generated structure can be achieved by manipulating spatial features and their self-attention inside the model. This results in a simple and effective approach, where features extracted from the guidance image are directly injected into the generation process of the target image, requiring no training or fine-tuning and applicable for both real or generated guidance images. We demonstrate high-quality results on versatile text-guided image translation tasks, including translating sketches, rough drawings and animations into realistic images, changing of the class and appearance of objects in a given image, and modifications of global qualities such as lighting and color.

updated: Tue Nov 22 2022 20:39:18 GMT+0000 (UTC)

published: Tue Nov 22 2022 20:39:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト