Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Sam Bond-Taylor; Peter Hessey; Hiroshi Sasaki; Toby P. Breckon; Chris G. Willcocks

トランスフォーマーの解き放ち：ベクトル量子化コードからの高速高解像度画像生成のための離散吸収拡散による並列トークン予測

拡散確率モデルは高品質の画像コンテンツを生成できますが、高解像度画像の生成とそれに関連する高い計算要件の両方に関して、重要な制限が残っています。最近のベクトル量子化画像モデルは、画像解像度のこの制限を克服しましたが、以前の要素ごとの自己回帰サンプリングを介してトークンを生成するため、非常に遅く、単方向です。対照的に、この論文では、制約のないTransformerアーキテクチャをバックボーンとして使用することにより、ベクトル量子化トークンの並列予測を可能にする、新しい離散拡散確率モデルを提案します。トレーニング中、トークンは順序に依存しない方法でランダムにマスクされ、Transformerは元のトークンを予測することを学習します。ベクトル量子化トークン予測のこの並列性により、計算コストのほんの一部で、グローバルに一貫した高解像度で多様な画像の無条件の生成が容易になります。このようにして、画像ごとの尤度推定を追加でプロビジョニングしながら、元のトレーニングセットサンプルの解像度を超える画像解像度を生成できます（生成的敵対的アプローチとは異なります）。私たちのアプローチは、密度（LSUNベッドルーム：1.51; LSUN教会：1.12; FFHQ：1.20）とカバレッジ（LSUNベッドルーム：0.83; LSUN教会：0.73; FFHQ：0.80）の点で最先端の結果を達成し、 FID（LSUNベッドルーム：3.64; LSUNチャーチ：4.07; FFHQ：6.11）で競争力を持ちながら、計算とトレーニングセット要件の削減の両方の点で利点を提供します。

Whilst diffusion probabilistic models can generate high quality image content, key limitations remain in terms of both generating high-resolution imagery and their associated high computational requirements. Recent Vector-Quantized image models have overcome this limitation of image resolution but are prohibitively slow and unidirectional as they generate tokens via element-wise autoregressive sampling from the prior. By contrast, in this paper we propose a novel discrete diffusion probabilistic model prior which enables parallel prediction of Vector-Quantized tokens by using an unconstrained Transformer architecture as the backbone. During training, tokens are randomly masked in an order-agnostic manner and the Transformer learns to predict the original tokens. This parallelism of Vector-Quantized token prediction in turn facilitates unconditional generation of globally consistent high-resolution and diverse imagery at a fraction of the computational expense. In this manner, we can generate image resolutions exceeding that of the original training set samples whilst additionally provisioning per-image likelihood estimates (in a departure from generative adversarial approaches). Our approach achieves state-of-the-art results in terms of Density (LSUN Bedroom: 1.51; LSUN Churches: 1.12; FFHQ: 1.20) and Coverage (LSUN Bedroom: 0.83; LSUN Churches: 0.73; FFHQ: 0.80), and performs competitively on FID (LSUN Bedroom: 3.64; LSUN Churches: 4.07; FFHQ: 6.11) whilst offering advantages in terms of both computation and reduced training set requirements.

updated: Wed Nov 24 2021 18:55:14 GMT+0000 (UTC)

published: Wed Nov 24 2021 18:55:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト