All are Worth Words: A ViT Backbone for Diffusion Models

Fan Bao; Shen Nie; Kaiwen Xue; Yue Cao; Chongxuan Li; Hang Su; Jun Zhu

すべてが言葉に値する: 拡散モデルの ViT バックボーン

ビジョントランスフォーマー (ViT) は、さまざまなビジョンタスクで有望である一方、畳み込みニューラルネットワーク (CNN) に基づく U-Net は拡散モデルで依然として支配的です。拡散モデルを使用した画像生成用のシンプルで一般的な ViT ベースのアーキテクチャ (U-ViT と呼ばれる) を設計します。 U-ViT は、時間、条件、およびノイズの多い画像パッチを含むすべての入力をトークンとして扱い、浅い層と深い層の間の長いスキップ接続を採用することを特徴としています。無条件およびクラス条件付きの画像生成、およびテキストから画像への生成タスクで U-ViT を評価します。U-ViT は、同様のサイズの CNN ベースの U-Net よりも優れていないとしても同等です。特に、U-ViT を使用した潜在拡散モデルは、ImageNet 256x256 でのクラス条件付き画像生成で 2.29、MS-COCO でのテキストから画像への生成で 5.48 という記録破りの FID スコアを達成します。生成モデルのトレーニング。私たちの結果は、拡散ベースの画像モデリングでは、CNN ベースの U-Net のダウンサンプリングおよびアップサンプリング演算子が必ずしも必要ではない一方で、長いスキップ接続が重要であることを示唆しています。 U-ViT は、拡散モデルのバックボーンに関する将来の研究に洞察を提供し、大規模なクロスモダリティデータセットの生成モデリングに役立つと考えています。

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.

updated: Sat Mar 25 2023 13:01:42 GMT+0000 (UTC)

published: Sun Sep 25 2022 05:21:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト