Grounded Text-to-Image Synthesis with Attention Refocusing

Quynh Phung; Songwei Ge; Jia-Bin Huang

注意を再集中させたグラウンディングされたテキストから画像への合成

大規模なテキストと画像のペアデータセットでトレーニングされたスケーラブルな拡散モデルによって駆動される、テキストから画像への合成手法は、説得力のある結果を示しています。ただし、複数のオブジェクト、属性、空間構成がプロンプトに含まれる場合、これらのモデルは依然としてテキストプロンプトに正確に従うことができません。この論文では、拡散モデルのクロスアテンション層とセルフアテンション層の両方における潜在的な理由を特定します。サンプリングプロセス中に特定のレイアウトに従ってアテンションマップを再焦点合わせするための2つの新しい損失を提案します。大規模言語モデルによって合成されたレイアウトを使用して、DrawBench と HRS ベンチマークで包括的な実験を実行し、提案された損失が既存のテキストから画像への方法に簡単かつ効果的に統合でき、生成された画像とテキストプロンプトの間の位置合わせを一貫して改善できることを示しています。。

Driven by scalable diffusion models trained on large-scale paired text-image datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt when multiple objects, attributes, and spatial compositions are involved in the prompt. In this paper, we identify the potential reasons in both the cross-attention and self-attention layers of the diffusion model. We propose two novel losses to refocus the attention maps according to a given layout during the sampling process. We perform comprehensive experiments on the DrawBench and HRS benchmarks using layouts synthesized by Large Language Models, showing that our proposed losses can be integrated easily and effectively into existing text-to-image methods and consistently improve their alignment between the generated images and the text prompts.

updated: Thu Jun 08 2023 17:59:59 GMT+0000 (UTC)

published: Thu Jun 08 2023 17:59:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト