Late-Constraint Diffusion Guidance for Controllable Image Synthesis

Chang Liu; Dong Liu

制御可能な画像合成のための遅延制約拡散ガイダンス

拡散モデルは、テキスト条件の有無にかかわらず、いくつかの単語またはまったく単語が与えられない場合でも、フォトリアリスティックな画像を合成する優れた能力を実証しています。通常のユーザーやアーティストは、全体のレイアウト、色、構造、オブジェクトの形状などの特定のガイダンスに従って合成画像を制御することを意図しているため、これらのモデルはユーザーのニーズを完全には満たさない可能性があります。制御可能な画像合成に拡散モデルを適応させるために、拡散ノイズ除去ネットワークの中間特徴に対する正則化として必要な条件を組み込むいくつかの方法が提案されています。これらの方法 (本稿では初期制約方法と呼びます) は、単一の解決策で複数の条件を処理することが困難です。彼らは、特定の条件ごとに個別のモデルをトレーニングすることを意図していますが、これには多大なトレーニングコストが必要となり、一般化できないソリューションが得られます。これらの問題に対処するために、我々は、遅延制約という新しいアプローチを提案します。つまり、拡散ネットワークを変更せずに残しますが、その出力が必要な条件に一致するように制約します。具体的には、軽量の条件アダプターをトレーニングして、外部条件と拡散モデルの内部表現の間の相関関係を確立します。反復ノイズ除去プロセス中に、条件付きガイダンスが対応する条件アダプターに送信され、確立された相関関係でサンプリングプロセスが操作されます。さらに、導入された後期制約戦略にタイムステップリサンプリング手法と早期停止手法を装備し、ガイダンスに準拠しながら合成画像の品質を向上させます。私たちの方法は既存の早期制約方法よりも優れており、目に見えない条件をより良く一般化します。私たちのコードが利用可能になります。

Diffusion models, either with or without text condition, have demonstrated impressive capability in synthesizing photorealistic images given a few or even no words. These models may not fully satisfy user need, as normal users or artists intend to control the synthesized images with specific guidance, like overall layout, color, structure, object shape, and so on. To adapt diffusion models for controllable image synthesis, several methods have been proposed to incorporate the required conditions as regularization upon the intermediate features of the diffusion denoising network. These methods, known as early-constraint ones in this paper, have difficulties in handling multiple conditions with a single solution. They intend to train separate models for each specific condition, which require much training cost and result in non-generalizable solutions. To address these difficulties, we propose a new approach namely late-constraint: we leave the diffusion networks unchanged, but constrain its output to be aligned with the required conditions. Specifically, we train a lightweight condition adapter to establish the correlation between external conditions and internal representations of diffusion models. During the iterative denoising process, the conditional guidance is sent into corresponding condition adapter to manipulate the sampling process with the established correlation. We further equip the introduced late-constraint strategy with a timestep resampling method and an early stopping technique, which boost the quality of synthesized image meanwhile complying with the guidance. Our method outperforms the existing early-constraint methods and generalizes better to unseen condition. Our code would be available.

updated: Wed Jun 14 2023 12:29:17 GMT+0000 (UTC)

published: Fri May 19 2023 08:40:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト