Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces

Dominic Rampas; Pablo Pernias; Elea Zhong; Marc Aubreville

ベクトル量子化された潜在空間での高速テキスト条件付き離散ノイズ除去

条件付きのテキストから画像への生成は、品質、多様性、および忠実度の点で、最近数え切れないほどの改善が見られます。それにもかかわらず、最先端のモデルのほとんどは、忠実な世代を生成するために多数の推論ステップを必要とするため、エンドユーザーアプリケーションのパフォーマンスのボトルネックが生じます。このホワイトペーパーでは、573M のパラメーターを持ちながら、500 ミリ秒未満で単一の画像をサンプリングできる速度最適化アーキテクチャを使用して、忠実度の高い画像をサンプリングするのに 10 ステップ未満しか必要としない新しいテキストから画像へのモデルである Paella を紹介します。このモデルは、圧縮および量子化された潜在空間で動作し、CLIP 埋め込みで調整され、以前の作品よりも改善されたサンプリング関数を使用します。テキスト条件付き画像生成とは別に、私たちのモデルは、潜在空間補間と、修復、修復、構造編集などの画像操作を行うことができます。 https://github.com/dome272/Paella ですべてのコードと事前トレーニング済みモデルをリリースします

Conditional text-to-image generation has seen countless recent improvements in terms of quality, diversity and fidelity. Nevertheless, most state-of-the-art models require numerous inference steps to produce faithful generations, resulting in performance bottlenecks for end-user applications. In this paper we introduce Paella, a novel text-to-image model requiring less than 10 steps to sample high-fidelity images, using a speed-optimized architecture allowing to sample a single image in less than 500 ms, while having 573M parameters. The model operates on a compressed & quantized latent space, it is conditioned on CLIP embeddings and uses an improved sampling function over previous works. Aside from text-conditional image generation, our model is able to do latent space interpolation and image manipulations such as inpainting, outpainting, and structural editing. We release all of our code and pretrained models at https://github.com/dome272/Paella

updated: Mon Nov 14 2022 11:52:55 GMT+0000 (UTC)

published: Mon Nov 14 2022 11:52:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト