Image Captions are Natural Prompts for Text-to-Image Models

Shiye Lei; Hao Chen; Sen Zhang; Bo Zhao; Dacheng Tao

画像キャプションはテキストから画像へのモデルの自然なプロンプトです

人工知能生成コンテンツ (AIGC) の急速な発展に伴い、データ不足とプライバシー漏洩の問題により、多くの学習タスクにおいて合成データで大規模なモデルをトレーニングまたは微調整することが一般的になってきました。無制限のデータ生成は有望ではありますが、実際の画像では大量かつ多様な情報が伝えられるため、テキストから画像への生成モデルが手作りのプロンプトを使用して有益なトレーニングデータを合成するのは困難であり、通常、下流のトレーニング時に汎化パフォーマンスが低下します。モデル。この論文では、合成データのトレーニング効果とプロンプトによって引き起こされる合成データの分布との関係を理論的に分析します。次に、これに応じて、テキストから画像への生成モデルがより有益で多様なトレーニングデータを合成するように促す、シンプルだが効果的な方法を提案します。具体的には、高度なキャプションモデルを使用して実際の各画像にキャプションを付け、クラス関連情報を抽出し、クラス名の多義性を明確にする有益で忠実なプロンプトを取得します。画像キャプションとクラス名が連結されて、画像合成をトレーニングするための生成モデルが生成されます。 ImageNette、ImageNet-100、および ImageNet-1K での広範な実験により、私たちの方法が合成トレーニングデータでトレーニングされたモデルのパフォーマンスを大幅に向上させる (つまり、分類精度が平均 10% 向上する) ことが検証されています。

With the rapid development of Artificial Intelligence Generated Content (AIGC), it has become common practice in many learning tasks to train or fine-tune large models on synthetic data due to the data-scarcity and privacy leakage problems. Albeit promising with unlimited data generation, owing to massive and diverse information conveyed in real images, it is challenging for text-to-image generative models to synthesize informative training data with hand-crafted prompts, which usually leads to inferior generalization performance when training downstream models. In this paper, we theoretically analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts. Then we correspondingly propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data. Specifically, we caption each real image with the advanced captioning model to obtain informative and faithful prompts that extract class-relevant information and clarify the polysemy of class names. The image captions and class names are concatenated to prompt generative models for training image synthesis. Extensive experiments on ImageNette, ImageNet-100, and ImageNet-1K verify that our method significantly improves the performance of models trained on synthetic training data, i.e., 10% classification accuracy improvements on average.

updated: Mon Jul 17 2023 14:38:11 GMT+0000 (UTC)

published: Mon Jul 17 2023 14:38:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト