Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

Royi Rassin; Eran Hirsch; Daniel Glickman; Shauli Ravfogel; Yoav Goldberg; Gal Chechik

拡散モデルにおける言語的バインディング: アテンションマップの調整による属性の対応の強化

テキスト条件付き画像生成モデルは、多くの場合、エンティティとその視覚的属性の間に誤った関連付けを生成します。これは、プロンプト内のエンティティと修飾子の言語的バインディングと、生成されたイメージ内の対応する要素の視覚的バインディングの間のマッピングが損なわれていることを反映しています。注目すべき例として、「ピンクのヒマワリと黄色のフラミンゴ」のようなクエリでは、誤って黄色のヒマワリとピンクのフラミンゴの画像が生成される可能性があります。この問題を解決するために、我々は SynGen を提案します。これは、まずエンティティとその修飾子を識別するためのプロンプトを構文的に分析し、次に、クロスアテンションマップが構文に反映された言語結合と一致するように促す新しい損失関数を使用するアプローチです。具体的には、エンティティとその修飾語のアテンションマップとの間に大きな重複を持たせ、他のエンティティや修飾語との重複を小さくすることを推奨します。損失は、モデルの再トレーニングや微調整を行わずに、推論中に最適化されます。 1 つの新しい挑戦的なセットを含む 3 つのデータセットに対する人による評価では、現在の最先端の方法と比較して SynGen が大幅に改善されていることが実証されました。この研究は、推論中に文構造を利用することで、テキストから画像への生成の忠実性がどのように効率的かつ大幅に向上するかを強調しています。

Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one notable example, a query like ``a pink sunflower and a yellow flamingo'' may incorrectly produce an image of a yellow sunflower and a pink flamingo. To remedy this issue, we propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifiers, and then uses a novel loss function that encourages the cross-attention maps to agree with the linguistic binding reflected by the syntax. Specifically, we encourage large overlap between attention maps of entities and their modifiers, and small overlap with other entities and modifier words. The loss is optimized during inference, without retraining or fine-tuning the model. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods. This work highlights how making use of sentence structure during inference can efficiently and substantially improve the faithfulness of text-to-image generation.

updated: Thu Jun 15 2023 06:21:44 GMT+0000 (UTC)

published: Thu Jun 15 2023 06:21:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト