Discriminative Class Tokens for Text-to-Image Diffusion Models

Idan Schwartz; Vésteinn Snæbjarnarson; Hila Chefer; Ryan Cotterell; Serge Belongie; Lior Wolf; Sagie Benaim

テキストから画像への拡散モデルの識別クラストークン

テキストから画像への拡散モデルの最近の進歩により、多様で高品質な画像の生成が可能になりました。画像は印象的ではありますが、多くの場合、微妙な詳細を描写するには不十分であり、入力テキストのあいまいさによってエラーが発生しやすくなります。これらの問題を軽減する 1 つの方法は、クラスラベル付きのデータセットで拡散モデルをトレーニングすることです。このアプローチには 2 つの欠点があります。(i) 教師付きデータセットは、テキストから画像へのモデルがトレーニングされる大規模なスクレイピングされたテキスト画像データセットに比べて一般に小さいため、生成される画像の品質と多様性に影響を及ぼします、または、(ii) input は自由形式のテキストではなくハードコーディングされたラベルであり、生成された画像の制御が制限されます。この研究では、事前トレーニングされた分類器からの識別信号を通じて高い精度を達成しながら、自由形式テキストの表現力の可能性を活用する非侵襲的な微調整手法を提案します。これは、生成された画像を分類子に従って特定のターゲットクラスに向けて誘導することにより、テキストから画像への拡散モデルの追加された入力トークンの埋め込みを繰り返し変更することによって行われます。私たちの方法は、以前の微調整方法と比較して高速であり、クラス内画像の収集やノイズ耐性分類器の再トレーニングを必要としません。私たちはこの方法を広範囲に評価し、生成された画像が (i) 標準の拡散モデルよりも正確で高品質であること、(ii) 低リソース設定でトレーニングデータを増強するために使用できること、および (iii) 情報を明らかにできることを示しています。ガイド分類器のトレーニングに使用されるデータについて。コードは https://github.com/idansc/discriminative_class_tokens で入手できます。

Recent advances in text-to-image diffusion models have enabled the generation of diverse and high-quality images. While impressive, the images often fall short of depicting subtle details and are susceptible to errors due to ambiguity in the input text. One way of alleviating these issues is to train diffusion models on class-labeled datasets. This approach has two disadvantages: (i) supervised datasets are generally small compared to large-scale scraped text-image datasets on which text-to-image models are trained, affecting the quality and diversity of the generated images, or (ii) the input is a hard-coded label, as opposed to free-form text, limiting the control over the generated images. In this work, we propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text while achieving high accuracy through discriminative signals from a pretrained classifier. This is done by iteratively modifying the embedding of an added input token of a text-to-image diffusion model, by steering generated images toward a given target class according to a classifier. Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images or retraining of a noise-tolerant classifier. We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier. The code is available at https://github.com/idansc/discriminative_class_tokens.

updated: Sun Sep 10 2023 17:33:30 GMT+0000 (UTC)

published: Thu Mar 30 2023 05:25:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト