Towards Open-World Text-Guided Face Image Generation and Manipulation

Weihao Xia; Yujiu Yang; Jing-Hao Xue; Baoyuan Wu

オープンワールドのテキストガイド付き顔画像の生成と操作に向けて

既存のテキストガイド付き画像合成方法では、最大256 ^ 2の解像度で限られた品質の結果しか生成できず、テキストによる指示は小さなコーパスに制限されます。この作業では、マルチモーダル入力から1024で前例のない解像度で多様で高品質の画像を生成する、顔画像の生成と操作の両方のための統一されたフレームワークを提案します。さらに重要なことに、私たちの方法は、再トレーニング、微調整、または後処理なしで、画像とテキストの両方を含むオープンワールドシナリオをサポートします。具体的には、事前にトレーニングされたGANモデルの優れた特性に基づいて、テキストガイド付き画像の生成と操作のまったく新しいパラダイムを提案します。私たちが提案するパラダイムには、2つの新しい戦略が含まれています。最初の戦略は、テキストエンコーダーをトレーニングして、前述の事前トレーニング済みGANモデルの階層的セマンティクスと一致する潜在コードを取得することです。 2番目の戦略は、事前にトレーニングされた言語モデルからのガイダンスを使用して、事前にトレーニングされたGANモデルの潜在空間内の潜在コードを直接最適化することです。潜在コードは、事前分布からランダムにサンプリングするか、特定の画像から反転することができます。これにより、テキストガイダンスを使用して、スケッチやセマンティックラベルなどのマルチモーダル入力からの画像生成と操作の両方に固有のサポートが提供されます。テキスト誘導マルチモーダル合成を容易にするために、マルチモーダルCelebA-HQを提案します。これは、実際の顔画像と対応するセマンティックセグメンテーションマップ、スケッチ、およびテキスト記述で構成される大規模なデータセットです。導入されたデータセットでの広範な実験は、提案された方法の優れたパフォーマンスを示しています。コードとデータはhttps://github.com/weihaox/TediGANで入手できます。

The existing text-guided image synthesis methods can only produce limited quality results with at most 256^2 resolution and the textual instructions are constrained in a small Corpus. In this work, we propose a unified framework for both face image generation and manipulation that produces diverse and high-quality images with an unprecedented resolution at 1024 from multimodal inputs. More importantly, our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing. To be specific, we propose a brand new paradigm of text-guided image generation and manipulation based on the superior characteristics of a pretrained GAN model. Our proposed paradigm includes two novel strategies. The first strategy is to train a text encoder to obtain latent codes that align with the hierarchically semantic of the aforementioned pretrained GAN model. The second strategy is to directly optimize the latent codes in the latent space of the pretrained GAN model with guidance from a pretrained language model. The latent codes can be randomly sampled from a prior distribution or inverted from a given image, which provides inherent supports for both image generation and manipulation from multi-modal inputs, such as sketches or semantic labels, with textual guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at https://github.com/weihaox/TediGAN.

updated: Sun Apr 18 2021 16:56:07 GMT+0000 (UTC)

published: Sun Apr 18 2021 16:56:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト