Vector-quantized Image Modeling with Improved VQGAN

Jiahui Yu; Xin Li; Jing Yu Koh; Han Zhang; Ruoming Pang; James Qin; Alexander Ku; Yuanzhong Xu; Jason Baldridge; Yonghui Wu

改善されたVQGANによるベクトル量子化画像モデリング

大規模なテキストコーパスの次のトークン予測を使用して言語モデルを事前トレーニングすることで、生成的言語タスクと識別的言語タスクの両方で、驚異的なゼロショット、少数ショット、転送学習、およびマルチタスク機能が提供されました。この成功を動機として、ラスタライズされた画像トークンを自動回帰的に予測するためにTransformerを事前トレーニングすることを含むベクトル量子化画像モデリング（VIM）アプローチを検討します。離散画像トークンは、学習したVision-TransformerベースのVQGAN（ViT-VQGAN）からエンコードされます。最初に、アーキテクチャからコードブック学習まで、バニラVQGANに対する複数の改善を提案し、効率と再構築の忠実度を向上させます。改善されたViT-VQGANは、無条件のクラス条件付き画像生成や教師なし表現学習など、ベクトル量子化画像モデリングタスクをさらに改善します。 ImageNetで256x256の解像度でトレーニングすると、175.1の開始スコア（IS）と4.17のフレシェ開始距離（FID）を達成します。これは、ISとFIDでそれぞれ70.6と17.04を取得するバニラVQGANよりも劇的に改善されています。 ViT-VQGANと教師なし事前トレーニングに基づいて、Image GPT（iGPT）と同様に、中間機能を平均化することにより、事前トレーニングされたTransformerをさらに評価します。このImageNetで事前トレーニングされたVIM-Lは、同様のモデルサイズで60.3％から72.2％の線形プローブ精度でiGPT-Lを大幅に上回っています。 ViM-Lは、追加のWeb画像データとより大きなモデルサイズでトレーニングされたiGPT-XLよりも優れています。

Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 72.2% for a similar model size. ViM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.

updated: Sat Oct 09 2021 18:36:00 GMT+0000 (UTC)

published: Sat Oct 09 2021 18:36:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト