What the DAAM: Interpreting Stable Diffusion Using Cross Attention

Raphael Tang; Akshat Pandey; Zhiying Jiang; Gefei Yang; Karun Kumar; Jimmy Lin; Ferhan Ture

What the DAAM: Cross Attention を使用した安定拡散の解釈

大規模な拡散ニューラルネットワークは、テキストから画像への生成における重要なマイルストーンであり、人間による評価で実際の写真と同様のパフォーマンスを発揮するものもあります。ただし、主に独自のクローズドソースの性質により、説明可能性と解釈可能性の分析が不足しており、理解が不十分なままです。この論文では、テキストから画像への拡散モデルに非常に必要な光を当てるために、最近オープンソース化された大規模な拡散モデルである Stable Diffusion でテキスト画像属性分析を実行します。ピクセルレベルのアトリビューションマップを作成するために、DAAM を提案します。DAAM は、潜在的なノイズ除去サブネットワーク内のクロスアテンションアクティベーションのアップスケーリングと集約に基づく新しい方法です。教師ありセグメンテーションモデルと比較して、独自に生成された画像で教師なしセマンティックセグメンテーションの品質を評価することにより、その正確性をサポートします。 DAAM が COCO キャプション生成画像で強力に機能し、mIoU 61.0 を達成し、mIoU 51.5 で、オープン語彙セグメンテーションで教師ありモデルよりも優れていることを示します。さらに、句読点や接続詞などの特定の品詞が、生成された画像に最も影響を与えることがわかりました。これは、以前の文献と一致していますが、決定詞と数字は最も少なく、計算能力が低いことを示唆しています。私たちの知る限り、大規模な拡散モデルを解釈するための単語ピクセル属性を提案し、研究したのは私たちが初めてです。コードとデータは https://github.com/castorini/daam にあります。

Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, with some performing similar to real photographs in human evaluation. However, they remain poorly understood, lacking explainability and interpretability analyses, largely due to their proprietary, closed-source nature. In this paper, to shine some much-needed light on text-to-image diffusion models, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced large diffusion model. To produce pixel-level attribution maps, we propose DAAM, a novel method based on upscaling and aggregating cross-attention activations in the latent denoising subnetwork. We support its correctness by evaluating its unsupervised semantic segmentation quality on its own generated imagery, compared to supervised segmentation models. We show that DAAM performs strongly on COCO caption-generated images, achieving an mIoU of 61.0, and it outperforms supervised models on open-vocabulary segmentation, for an mIoU of 51.5. We further find that certain parts of speech, like punctuation and conjunctions, influence the generated imagery most, which agrees with the prior literature, while determiners and numerals the least, suggesting poor numeracy. To our knowledge, we are the first to propose and study word-pixel attribution for interpreting large-scale diffusion models. Our code and data are at https://github.com/castorini/daam.

updated: Thu Oct 13 2022 02:00:54 GMT+0000 (UTC)

published: Mon Oct 10 2022 17:55:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト