SSD: Towards Better Text-Image Consistency Metric in Text-to-Image Generation

Zhaorui Tan; Xi Yang; Zihan Ye; Qiufeng Wang; Yuyao Yan; Anh Nguyen; Kaizhu Huang

SSD: テキストから画像への生成におけるテキストと画像の一貫性メトリックの向上に向けて

与えられたテキストから一貫性のある高品質の画像を生成することは、視覚言語の理解にとって不可欠です。高品質の画像を生成することで印象的な結果が達成されましたが、テキスト画像の一貫性は、既存の GAN ベースの方法では依然として大きな懸念事項です。特に、最も一般的なメトリック R 精度は、テキストイメージの一貫性を正確に反映していない可能性があり、多くの場合、生成されたイメージで非常に誤解を招くセマンティクスが発生します。その重要性にもかかわらず、驚くべきことに、より優れたテキストと画像の一貫性メトリックを設計する方法は、コミュニティで十分に調査されていません.この論文では、セマンティック類似距離 (SSD) と呼ばれる新しい CLIP ベースのメトリックを開発するためのさらなる一歩を踏み出します。これは、分布の観点から理論的に確立され、ベンチマークデータセットで経験的に検証されます。提案されたメトリックの恩恵を受けて、異なる粒度でセマンティック情報を融合し、正確なセマンティクスをキャプチャすることにより、テキストと画像の一貫性を向上させることを目的とした Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN) をさらに設計します。提案された PDF-GAN は、Hard-Negative Sentence Constructor と Semantic Projection という 2 つの新しいプラグアンドプレイコンポーネントを備えており、一貫性のないセマンティクスを軽減し、テキストと画像のセマンティックギャップを埋めることができます。一連の実験は、現在の最先端の方法とは対照的に、CUB および COCO データセットでまともな画像品質を維持しながら、PDF-GAN がテキスト画像の一貫性を大幅に向上させることができることを示しています。

Generating consistent and high-quality images from given texts is essential for visual-language understanding. Although impressive results have been achieved in generating high-quality images, text-image consistency is still a major concern in existing GAN-based methods. Particularly, the most popular metric R-precision may not accurately reflect the text-image consistency, often resulting in very misleading semantics in the generated images. Albeit its significance, how to design a better text-image consistency metric surprisingly remains under-explored in the community. In this paper, we make a further step forward to develop a novel CLIP-based metric termed as Semantic Similarity Distance (SSD), which is both theoretically founded from a distributional viewpoint and empirically verified on benchmark datasets. Benefiting from the proposed metric, we further design the Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN) that aims at improving text-image consistency by fusing semantic information at different granularities and capturing accurate semantics. Equipped with two novel plug-and-play components: Hard-Negative Sentence Constructor and Semantic Projection, the proposed PDF-GAN can mitigate inconsistent semantics and bridge the text-image semantic gap. A series of experiments show that, as opposed to current state-of-the-art methods, our PDF-GAN can lead to significantly better text-image consistency while maintaining decent image quality on the CUB and COCO datasets.

updated: Sat Dec 03 2022 05:15:10 GMT+0000 (UTC)

published: Thu Oct 27 2022 07:47:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト