Progressive Denoising Model for Fine-Grained Text-to-Image Generation

Zhengcong Fei; Mingyuan Fan; Junshi Huang; Xiaoming Wei; Xiaolin Wei

きめの細かいテキストから画像への生成のためのプログレッシブノイズ除去モデル

最近、ベクトル量子化された自己回帰 (VQ-AR) モデルは、潜在空間の左上から右下まで離散画像トークンを均等に予測することにより、テキストから画像への合成で顕著な結果を示しました。単純な生成プロセスは驚くほどうまく機能しますが、これが画像を生成する最良の方法ですか?たとえば、人間が作成したものは、画像の輪郭から細かさにこだわる傾向がありますが、VQ-AR モデル自体は、各コンポーネントの相対的な重要性を考慮していません。この論文では、忠実度の高いテキストから画像への画像生成のための漸進的ノイズ除去モデルを提示します。提案された方法は、既存のコンテキストに基づいて粗いものから細かいものまで新しい画像トークンを並行して作成することによって有効になり、この手順は画像シーケンスが完了するまで再帰的に適用されます。結果として生じる粗いものから細かいものへの階層により、画像生成プロセスが直感的で解釈可能になります。広範な実験により、さまざまなカテゴリと側面にわたる FID スコアで以前の VQ-AR 法と比較した場合、プログレッシブモデルが大幅に優れた結果を生み出すことが示されています。さらに、従来の AR のテキストから画像への生成時間は、出力画像の解像度に比例して増加するため、通常サイズの画像でもかなりの時間がかかります。対照的に、私たちのアプローチでは、生成の品質と速度の間のより良いトレードオフを実現できます。

Recently, vector quantized autoregressive (VQ-AR) models have shown remarkable results in text-to-image synthesis by equally predicting discrete image tokens from the top left to bottom right in the latent space. Although the simple generative process surprisingly works well, is this the best way to generate the image? For instance, human creation is more inclined to the outline-to-fine of an image, while VQ-AR models themselves do not consider any relative importance of each component. In this paper, we present a progressive denoising model for high-fidelity text-to-image image generation. The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context in a parallel manner and this procedure is recursively applied until an image sequence is completed. The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable. Extensive experiments demonstrate that the progressive model produces significantly better results when compared with the previous VQ-AR method in FID score across a wide variety of categories and aspects. Moreover, the text-to-image generation time of traditional AR increases linearly with the output image resolution and hence is quite time-consuming even for normal-size images. In contrast, our approach allows achieving a better trade-off between generation quality and speed.

updated: Fri Nov 04 2022 13:54:52 GMT+0000 (UTC)

published: Wed Oct 05 2022 14:27:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト