Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu; Yuanzhong Xu; Jing Yu Koh; Thang Luong; Gunjan Baid; Zirui Wang; Vijay Vasudevan; Alexander Ku; Yinfei Yang; Burcu Karagol Ayan; Ben Hutchinson; Wei Han; Zarana Parekh; Xin Li; Han Zhang; Jason Baldridge; Yonghui Wu

コンテンツが豊富なテキストから画像への生成のための自己回帰モデルのスケーリング

Pathways Autoregressive Text-to-Image（Parti）モデルを紹介します。このモデルは、忠実度の高いフォトリアリスティックな画像を生成し、複雑な構成と世界の知識を含むコンテンツが豊富な合成をサポートします。 Partiは、テキストから画像への生成を、機械翻訳に似たシーケンスからシーケンスへのモデリングの問題として扱います。画像トークンのシーケンスは、別の言語のテキストトークンではなく、ターゲット出力として使用されます。この戦略は、データとモデルサイズのスケーリングを通じて機能とパフォーマンスが継続的に向上している、大規模な言語モデルに関するこれまでの豊富な作業を自然に活用できます。私たちのアプローチは単純です。まず、PartiはTransformerベースの画像トークナイザーViT-VQGANを使用して、画像を個別のトークンのシーケンスとしてエンコードします。次に、MS-COCOで最新のゼロショットFIDスコア7.23と微調整されたFIDスコア3.22を使用して、エンコーダー-デコーダーTransformerモデルを最大20Bパラメーターにスケーリングすることにより、一貫した品質向上を実現します。ローカライズされたナラティブと、1600を超える英語のプロンプトの新しい全体的なベンチマークであるPartiPrompts（P2）に関する詳細な分析は、さまざまなカテゴリと難易度の側面にわたるPartiの有効性を示しています。また、さらなる改善のために重点を置く主要な領域を定義および例示するために、モデルの制限を調査および強調します。高解像度の画像については、https：//parti.research.google/を参照してください。

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.

updated: Wed Jun 22 2022 01:11:29 GMT+0000 (UTC)

published: Wed Jun 22 2022 01:11:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト