LAFITE: Towards Language-Free Training for Text-to-Image Generation

Yufan Zhou; Ruiyi Zhang; Changyou Chen; Chunyuan Li; Chris Tensmeyer; Tong Yu; Jiuxiang Gu; Jinhui Xu; Tong Sun

LAFITE：テキストから画像への生成のための言語フリートレーニングに向けて

テキストから画像への生成モデルをトレーニングする際の主要な課題の1つは、高品質の画像とテキストのペアを多数必要とすることです。画像サンプルには簡単にアクセスできることがよくありますが、関連するテキストの説明には通常、注意深い人間のキャプションが必要であり、これは特に時間とコストがかかります。この論文では、テキストデータなしでテキストから画像への生成モデルをトレーニングする最初の作業を提案します。私たちの方法は、強力な事前トレーニング済みCLIPモデルの適切に調整されたマルチモーダルセマンティックスペースを活用します。テキスト条件付けの要件は、画像の特徴からテキストの特徴を生成することでシームレスに緩和されます。提案された方法の有効性を説明するために、広範な実験が行われる。標準のテキストから画像への生成タスクで最先端の結果を取得します。重要なことに、提案された言語フリーモデルは、完全な画像とテキストのペアでトレーニングされたほとんどの既存のモデルよりも優れています。さらに、私たちの方法は、事前トレーニング済みモデルの微調整に適用できるため、テキストから画像への生成モデルのトレーニングでトレーニング時間とコストの両方を節約できます。事前トレーニング済みのモデルは、MS-COCOデータセットでのゼロショットのテキストから画像への生成で競争力のある結果を取得しますが、最近提案された大規模なDALL-Eモデルと比較してモデルサイズとトレーニングデータサイズの約1％しかありません。

One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.

updated: Sat Nov 27 2021 01:54:45 GMT+0000 (UTC)

published: Sat Nov 27 2021 01:54:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト