Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

Qiucheng Wu; Yujian Liu; Handong Zhao; Trung Bui; Zhe Lin; Yang Zhang; Shiyu Chang

忠実度の高いテキストから画像への合成のための拡散モデルの時空間的注意の活用

拡散ベースのモデルは、テキストから画像への合成タスクで最先端のパフォーマンスを達成しました。ただし、これらのモデルの重大な制限の 1 つは、オブジェクトの欠落、属性の不一致、オブジェクトの位置の誤りなど、テキストの説明に関して生成された画像の忠実度が低いことです。このような不一致の主な理由の 1 つは、オブジェクトが表示されるピクセル領域を制御する空間次元と、ノイズ除去ステップを通じてさまざまなレベルの詳細がどのように追加されるかを制御する時間次元の両方で、テキストに対する不正確なクロスアテンションです。この論文では、拡散モデルにおける時空間相互注意の明示的な制御を追加する新しいテキストから画像へのアルゴリズムを提案します。まず、レイアウト予測子を使用して、テキストで言及されているオブジェクトのピクセル領域を予測します。次に、テキスト記述全体に対する注意と、そのオブジェクトの対応するピクセル領域内の特定のオブジェクトの局所的説明に対する注意を組み合わせることにより、空間的注意制御を課します。ノイズ除去ステップごとに組み合わせの重みを変更できるようにすることで、一時的な注意制御がさらに追加され、組み合わせの重みが最適化されて、画像とテキストの間の忠実度が高くなります。実験は、拡散モデルを微調整することなく、拡散モデルベースのベースラインと比較して、この方法がより忠実度の高い画像を生成することを示しています。私たちのコードは、https://github.com/UCSB-NLP-Chang/Diffusion-SpaceTime-Attn で公開されています。

Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks. However, one critical limitation of these models is the low fidelity of generated images with respect to the text description, such as missing objects, mismatched attributes, and mislocated objects. One key reason for such inconsistencies is the inaccurate cross-attention to text in both the spatial dimension, which controls at what pixel region an object should appear, and the temporal dimension, which controls how different levels of details are added through the denoising steps. In this paper, we propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models. We first utilize a layout predictor to predict the pixel regions for objects mentioned in the text. We then impose spatial attention control by combining the attention over the entire text description and that over the local description of the particular object in the corresponding pixel region of that object. The temporal attention control is further added by allowing the combination weights to change at each denoising step, and the combination weights are optimized to ensure high fidelity between the image and the text. Experiments show that our method generates images with higher fidelity compared to diffusion-model-based baselines without fine-tuning the diffusion model. Our code is publicly available at https://github.com/UCSB-NLP-Chang/Diffusion-SpaceTime-Attn.

updated: Fri Apr 07 2023 23:49:34 GMT+0000 (UTC)

published: Fri Apr 07 2023 23:49:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト