Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Yuki Endo

テキストから画像への生成を空間的に制御するためのマスクされた注意拡散ガイダンス

テキストから画像への合成は、拡散モデルの最近の進歩により高品質の結果を達成しています。ただし、テキスト入力だけでは空間的な曖昧さが大きく、ユーザーの制御性が制限されます。既存のほとんどの方法では、追加の視覚的ガイダンス (スケッチやセマンティックマスクなど) による空間制御が可能ですが、注釈付き画像を使用した追加のトレーニングが必要です。この論文では、拡散モデルをさらにトレーニングすることなく、テキストから画像への生成を空間的に制御する方法を提案します。私たちの手法は、クロスアテンションマップが単語とピクセルの位置関係を反映しているという洞察に基づいています。私たちの目的は、指定されたセマンティックマスクとテキストプロンプトに従ってアテンションマップを制御することです。この目的を達成するために、まず、クロスアテンションマップを、セマンティック領域から計算された定数マップと直接交換する単純なアプローチを検討します。さらに、最初のアプローチよりもセマンティックマスクに忠実な画像を生成できるマスクされた注意ガイダンスを提案します。マスクされた注意ガイダンスは、拡散モデルに供給されるノイズ画像を操作することにより、意味領域に従って各単語およびピクセルに対する注意を間接的に制御します。実験により、私たちの方法は定性的および定量的にベースラインよりも正確な空間制御を可能にすることが示されています。

Text-to-image synthesis has achieved high-quality results with recent advances in diffusion models. However, text input alone has high spatial ambiguity and limited user controllability. Most existing methods allow spatial control through additional visual guidance (e.g, sketches and semantic masks) but require additional training with annotated images. In this paper, we propose a method for spatially controlling text-to-image generation without further training of diffusion models. Our method is based on the insight that the cross-attention maps reflect the positional relationship between words and pixels. Our aim is to control the attention maps according to given semantic masks and text prompts. To this end, we first explore a simple approach of directly swapping the cross-attention maps with constant maps computed from the semantic regions. Moreover, we propose masked-attention guidance, which can generate images more faithful to semantic masks than the first approach. Masked-attention guidance indirectly controls attention to each word and pixel according to the semantic regions by manipulating noise images fed to diffusion models. Experiments show that our method enables more accurate spatial control than baselines qualitatively and quantitatively.

updated: Fri Aug 11 2023 09:15:22 GMT+0000 (UTC)

published: Fri Aug 11 2023 09:15:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト