UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image

Dani Valevski; Matan Kalman; Eyal Molad; Eyal Segalis; Yossi Matias; Yaniv Leviathan

UniTune: 単一画像の拡散モデルを微調整することによるテキスト駆動の画像編集

最近、テキスト駆動の画像生成方法が目覚ましい結果を示しており、カジュアルユーザーでもテキストによる説明を提供することで高品質の画像を生成できるようになりました。ただし、既存の画像を編集するための同様の機能はまだ実現できません。テキスト駆動の画像編集方法では通常、編集マスクが必要で、大幅な視覚的変更を必要とする編集に苦労し、編集部分の特定の詳細を簡単に保持することができません。この論文では、画像生成モデルを単一の画像上で微調整するだけで画像編集モデルに変換できることを観察します。また、サンプリング前にベースイメージのノイズが含まれたバージョンで確率サンプラーを初期化し、サンプリング後にベースイメージから関連する詳細を補間すると、編集操作の品質がさらに向上することも示します。これらの観察を組み合わせて、新しい画像編集方法である UniTune を提案します。 UniTune は、任意の画像とテキストによる編集記述を入力として取得し、入力画像に対する高い忠実度を維持しながら編集を実行します。 UniTune はマスクやスケッチなどの追加入力を必要とせず、再トレーニングせずに同じ画像に対して複数の編集を実行できます。 Imagen モデルを使用して、さまざまなユースケースでメソッドをテストします。私たちは、これが広く適用可能であり、以前は不可能だった大幅な視覚的変更を必要とするものなど、驚くほど広範囲の表現力豊かな編集操作を実行できることを実証します。

Text-driven image generation methods have shown impressive results recently, allowing casual users to generate high quality images by providing textual descriptions. However, similar capabilities for editing existing images are still out of reach. Text-driven image editing methods usually need edit masks, struggle with edits that require significant visual changes and cannot easily keep specific details of the edited portion. In this paper we make the observation that image-generation models can be converted to image-editing models simply by fine-tuning them on a single image. We also show that initializing the stochastic sampler with a noised version of the base image before the sampling and interpolating relevant details from the base image after sampling further increase the quality of the edit operation. Combining these observations, we propose UniTune, a novel image editing method. UniTune gets as input an arbitrary image and a textual edit description, and carries out the edit while maintaining high fidelity to the input image. UniTune does not require additional inputs, like masks or sketches, and can perform multiple edits on the same image without retraining. We test our method using the Imagen model in a range of different use cases. We demonstrate that it is broadly applicable and can perform a surprisingly wide range of expressive editing operations, including those requiring significant visual changes that were previously impossible.

updated: Wed Jul 05 2023 12:35:29 GMT+0000 (UTC)

published: Mon Oct 17 2022 23:46:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト