GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol; Prafulla Dhariwal; Aditya Ramesh; Pranav Shyam; Pamela Mishkin; Bob McGrew; Ilya Sutskever; Mark Chen

GLIDE：テキストガイド拡散モデルを使用したフォトリアリスティックな画像生成と編集に向けて

拡散モデルは、特に多様性と忠実度をトレードオフするガイダンス手法と組み合わせると、高品質の合成画像を生成することが最近示されています。テキスト条件付き画像合成の問題の拡散モデルを調査し、CLIPガイダンスと分類器なしのガイダンスという2つの異なるガイダンス戦略を比較します。後者は、フォトリアリズムとキャプションの類似性の両方で人間の評価者に好まれ、フォトリアリスティックなサンプルを生成することがよくあります。分類子なしのガイダンスを使用した35億パラメーターのテキスト条件付き拡散モデルからのサンプルは、DALL-Eが高価なCLIP再ランク付けを使用している場合でも、人間の評価者によってDALL-Eからのサンプルよりも好まれます。さらに、モデルを微調整して画像の修復を実行できるため、強力なテキスト駆動型の画像編集が可能になります。フィルタリングされたデータセットでより小さなモデルをトレーニングし、https：//github.com/openai/glide-text2imでコードと重みをリリースします。

Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.

updated: Wed Dec 22 2021 18:39:39 GMT+0000 (UTC)

published: Mon Dec 20 2021 18:42:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト