FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization

Xingchao Liu; Chengyue Gong; Lemeng Wu; Shujian Zhang; Hao Su; Qiang Liu

FuseDream：CLIP + GANスペースの最適化が改善されたトレーニング不要のテキストから画像への生成

自然言語の指示から画像を生成することは、興味をそそるが非常に困難な作業です。再トレーニングされたCLIP表現の能力を既製の画像ジェネレーター（GAN）と組み合わせ、GANの潜在空間で最適化して、指定された入力テキストで最大のCLIPスコアを達成する画像を見つけることにより、テキストから画像への生成にアプローチします。。生成モデルをテキストから画像にゼロからトレーニングする従来の方法と比較すると、CLIP + GANアプローチはトレーニングが不要で、ゼロショットであり、さまざまなジェネレーターで簡単にカスタマイズできます。ただし、GANスペースでCLIPスコアを最適化すると、非常に困難な最適化問題が発生し、Adamなどの既製のオプティマイザーは満足のいく結果を得ることができません。この作業では、FuseDreamパイプラインを提案します。これは、3つの主要な手法でCLIP + GANアプローチを改善します。1）画像にランダムな拡張を導入することでCLIPの目的を強化するAugCLIPスコア。 2）GAN空間の非凸ランドスケープを効率的にナビゲートできる、最適化のための新しい初期化および過剰パラメーター化戦略。 3）新しいバイレベル最適化定式化を活用することにより、複数の画像を構成してGAN空間を拡張し、データバイアスを克服できる構成生成手法。 FuseDreamは、さまざまな入力テキストによってプロモートされると、さまざまなオブジェクト、背景、芸術的なスタイル、さらには使用するGANのトレーニングデータには表示されない斬新な反事実的概念を備えた高品質の画像を生成できます。定量的には、FuseDreamによって生成された画像は、追加のアーキテクチャ設計やトレーニングなしで、MSCOCOデータセットのトップレベルのInceptionスコアとFIDスコアを生成します。私たちのコードはhttps://github.com/gnobitab/FuseDreamで公開されています。

Generating images from natural language instructions is an intriguing yet highly challenging task. We approach text-to-image generation by combining the power of the retrained CLIP representation with an off-the-shelf image generator (GANs), optimizing in the latent space of GAN to find images that achieve maximum CLIP score with the given input text. Compared to traditional methods that train generative models from text to image starting from scratch, the CLIP+GAN approach is training-free, zero shot and can be easily customized with different generators. However, optimizing CLIP score in the GAN space casts a highly challenging optimization problem and off-the-shelf optimizers such as Adam fail to yield satisfying results. In this work, we propose a FuseDream pipeline, which improves the CLIP+GAN approach with three key techniques: 1) an AugCLIP score which robustifies the CLIP objective by introducing random augmentation on image. 2) a novel initialization and over-parameterization strategy for optimization which allows us to efficiently navigate the non-convex landscape in GAN space. 3) a composed generation technique which, by leveraging a novel bi-level optimization formulation, can compose multiple images to extend the GAN space and overcome the data-bias. When promoted by different input text, FuseDream can generate high-quality images with varying objects, backgrounds, artistic styles, even novel counterfactual concepts that do not appear in the training data of the GAN we use. Quantitatively, the images generated by FuseDream yield top-level Inception score and FID score on MS COCO dataset, without additional architecture design or training. Our code is publicly available at https://github.com/gnobitab/FuseDream.

updated: Thu Dec 02 2021 19:27:27 GMT+0000 (UTC)

published: Thu Dec 02 2021 19:27:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト