Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks

Qingrong Cheng; Keyu Wen; Xiaodong Gu

敵対的生成ネットワークを介したテキストから画像への合成のための視覚言語マッチング

テキストから画像への合成は、特定のテキスト記述から写実的で意味的に一貫した画像を生成することを目的としています。既製のモデルによって合成された画像には、通常、対応する画像やテキストの説明と比較してコンポーネントが限られているため、画像の品質とテキストと視覚の一貫性が低下します。この問題に対処するために、VLMGAN * という名前のテキストから画像への合成のための新しいビジョン言語マッチング戦略を提案します。これは、画像品質と意味の一貫性を強化するデュアルビジョン言語マッチングメカニズムを導入します。デュアルビジョン言語マッチングメカニズムは、生成された画像と対応するテキスト記述の間のテキストと視覚の一致、および合成画像と実際の画像の間の視覚と視覚の一貫した制約を考慮します。特定のテキスト記述が与えられると、VLMGAN* はまずそれをテキスト特徴にエンコードし、次にそれらをデュアルビジョン言語マッチングベースの生成モデルにフィードして、写真のようにリアルでテキストのセマンティックに一貫した画像を合成します。さらに、テキストから画像への合成の一般的な評価指標は、主に合成画像の現実性と多様性を評価する単純な画像生成から借用されています。したがって、視覚言語マッチングスコア (VLMS) という名前のメトリックを導入して、テキストから画像への合成のパフォーマンスを評価します。これは、合成された画像と説明の間の画質と意味の一貫性の両方を考慮することができます。提案されたデュアルマルチレベルビジョン言語マッチング戦略は、他のテキストから画像への合成方法に適用できます。 {VLMGAN_+AttnGAN} と {VLMGAN_+DFGAN} でマークされている 2 つの一般的なベースラインでこの戦略を実装します。広く使用されている 2 つのデータセットに関する実験結果は、このモデルが他の最先端の方法よりも大幅に改善されていることを示しています。

Text-to-image synthesis aims to generate a photo-realistic and semantic consistent image from a specific text description. The images synthesized by off-the-shelf models usually contain limited components compared with the corresponding image and text description, which decreases the image quality and the textual-visual consistency. To address this issue, we propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*, which introduces a dual vision-language matching mechanism to strengthen the image quality and semantic consistency. The dual vision-language matching mechanism considers textual-visual matching between the generated image and the corresponding text description, and visual-visual consistent constraints between the synthesized image and the real image. Given a specific text description, VLMGAN* firstly encodes it into textual features and then feeds them to a dual vision-language matching-based generative model to synthesize a photo-realistic and textual semantic consistent image. Besides, the popular evaluation metrics for text-to-image synthesis are borrowed from simple image generation, which mainly evaluates the reality and diversity of the synthesized images. Therefore, we introduce a metric named Vision-Language Matching Score (VLMS) to evaluate the performance of text-to-image synthesis which can consider both the image quality and the semantic consistency between synthesized image and the description. The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods. We implement this strategy on two popular baselines, which are marked with {VLMGAN_+AttnGAN} and {VLMGAN_+DFGAN}. The experimental results on two widely-used datasets show that the model achieves significant improvements over other state-of-the-art methods.

updated: Sat Aug 20 2022 03:34:04 GMT+0000 (UTC)

published: Sat Aug 20 2022 03:34:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト