Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models

Qiucheng Wu; Yujian Liu; Handong Zhao; Ajinkya Kale; Trung Bui; Tong Yu; Zhe Lin; Yang Zhang; Shiyu Chang

テキストから画像への拡散モデルにおけるもつれ解消機能の解明

生成モデルは、コンピュータービジョンで広く研究されています。最近、拡散モデルは、生成された画像の品質が高いため、大きな注目を集めています。画像生成モデルの重要な望ましい特性は、さまざまな属性を解きほぐす能力です。これにより、セマンティックコンテンツを変更せずにスタイルに向けた変更が可能になり、変更パラメーターはさまざまな画像に一般化される必要があります。以前の研究では、敵対的生成ネットワーク (GAN) には本質的にこのようなもつれを解く機能が備わっているため、ネットワークを再トレーニングしたり微調整したりすることなく、もつれた画像編集を実行できることがわかっています。この作業では、拡散モデルにもそのような機能が本質的に備わっているかどうかを調べます。私たちの発見は、安定した拡散モデルの場合、入力テキストの埋め込みをニュートラルな説明 (例: "a photo of person") からスタイルのある説明 (例: "a photo of person with smile") に部分的に変更する一方で、ノイズ除去プロセス中に導入されたガウスランダムノイズ。生成された画像は、セマンティックコンテンツを変更することなく、ターゲットスタイルに合わせて変更できます。この発見に基づいて、2 つのテキスト埋め込みの混合重みがスタイルマッチングとコンテンツ保存のために最適化される、シンプルで軽量な画像編集アルゴリズムをさらに提案します。このプロセス全体では、約 50 以上のパラメーターを最適化するだけで、拡散モデル自体を微調整することはありません。実験は、提案された方法が、微調整を必要とする拡散モデルベースの画像編集アルゴリズムよりも優れたパフォーマンスで、幅広い属性を変更できることを示しています。最適化された重みは、さまざまなイメージにうまく一般化されます。私たちのコードは、https://github.com/UCSB-NLP-Chang/DiffusionDisentanglement で公開されています。

Generative models have been widely studied in computer vision. Recently, diffusion models have drawn substantial attention due to the high quality of their generated images. A key desired property of image generative models is the ability to disentangle different attributes, which should enable modification towards a style without changing the semantic content, and the modification parameters should generalize to different images. Previous studies have found that generative adversarial networks (GANs) are inherently endowed with such disentanglement capability, so they can perform disentangled image editing without re-training or fine-tuning the network. In this work, we explore whether diffusion models are also inherently equipped with such a capability. Our finding is that for stable diffusion models, by partially changing the input text embedding from a neutral description (e.g., "a photo of person") to one with style (e.g., "a photo of person with smile") while fixing all the Gaussian random noises introduced during the denoising process, the generated images can be modified towards the target style without changing the semantic content. Based on this finding, we further propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. This entire process only involves optimizing over around 50 parameters and does not fine-tune the diffusion model itself. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms that require fine-tuning. The optimized weights generalize well to different images. Our code is publicly available at https://github.com/UCSB-NLP-Chang/DiffusionDisentanglement.

updated: Fri Dec 16 2022 19:58:52 GMT+0000 (UTC)

published: Fri Dec 16 2022 19:58:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト