Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Weixi Feng; Xuehai He; Tsu-Jui Fu; Varun Jampani; Arjun Akula; Pradyumna Narayana; Sugato Basu; Xin Eric Wang; William Yang Wang

構成テキストから画像への合成のためのトレーニング不要の構造化拡散ガイダンス

大規模な拡散モデルは、テキストから画像への合成 (T2I) タスクで最先端の結果を達成しました。高品質でありながらクリエイティブな画像を生成する能力にもかかわらず、特に複数のオブジェクトが関係する場合、属性バインディングと合成機能は依然として主要な課題と見なされていることがわかります.この作業では、T2I モデルの構成スキル、具体的にはより正確な属性バインディングとより優れた画像構成を改善します。これを行うために、拡散ベースのT2Iモデルで相互注意層を操作する制御可能な特性に基づいて、言語構造を拡散ガイダンスプロセスに組み込みます。クロスアテンションレイヤーのキーと値には、オブジェクトのレイアウトとコンテンツに関連する強力なセマンティックな意味があることがわかります。したがって、言語的洞察に基づいてクロスアテンション表現を操作することにより、生成された画像の合成セマンティクスをより適切に保存できます。 SOTA T2I モデルである Stable Diffusion に基づいて構築された、構造化された相互注意設計は効率的であり、追加のトレーニングサンプルは必要ありません。定性的および定量的な結果でより優れた構成スキルを達成し、直接のユーザー比較研究で 5 ～ 8% の優位性をもたらします。最後に、詳細な分析を行って、不適切な画像構成の潜在的な原因を明らかにし、生成プロセスにおけるクロスアテンションレイヤーのプロパティを正当化します。

Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process.

updated: Tue Feb 28 2023 23:46:24 GMT+0000 (UTC)

published: Fri Dec 09 2022 18:30:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト