Unleashing Text-to-Image Diffusion Models for Visual Perception

Wenliang Zhao; Yongming Rao; Zuyan Liu; Benlin Liu; Jie Zhou; Jiwen Lu

視覚認識のためのテキストから画像への拡散モデルを解き放つ

拡散モデル (DM) は、生成モデルの新しいトレンドとなり、条件合成の強力な能力を実証しました。その中でも、大規模な画像とテキストのペアで事前トレーニングされたテキストから画像への拡散モデルは、カスタマイズ可能なプロンプトによって高度に制御できます。低レベルの属性と詳細に焦点を当てた無条件の生成モデルとは異なり、テキストから画像への拡散モデルには、視覚言語の事前トレーニングのおかげで、より高レベルの知識が含まれています。この論文では、視覚認識タスクで事前にトレーニングされたテキストから画像への拡散モデルのセマンティック情報を活用する新しいフレームワークである VPD (事前トレーニングされた拡散モデルによる視覚認識) を提案します。拡散ベースのパイプラインで事前トレーニング済みのノイズ除去オートエンコーダーを使用する代わりに、単純にそれをバックボーンとして使用し、学習した知識を最大限に活用する方法を研究することを目指しています。具体的には、ノイズ除去デコーダーに適切なテキスト入力を促し、アダプターを使用してテキスト機能を調整します。これにより、事前トレーニング済みの段階への調整が改善され、ビジュアルコンテンツがテキストプロンプトと相互作用するようになります。また、明確なガイダンスを提供するために、視覚的特徴とテキスト特徴の間の相互注意マップを利用することも提案します。他の事前トレーニング方法と比較して、提案された VPD を使用して、視覚言語の事前トレーニング済み拡散モデルを下流の視覚認識タスクにより速く適応させることができることを示します。セマンティックセグメンテーション、参照画像セグメンテーション、深度推定に関する広範な実験により、この方法の有効性が実証されています。特に、VPD は NYUv2 深度推定で 0.254 RMSE を達成し、RefCOCO-val 参照画像セグメンテーションで 73.3% oIoU を達成し、これら 2 つのベンチマークで新しい記録を確立しました。コードは https://github.com/wl-zhao/VPD で入手できます

Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring image segmentation, establishing new records on these two benchmarks. Code is available at https://github.com/wl-zhao/VPD

updated: Fri Mar 03 2023 18:59:47 GMT+0000 (UTC)

published: Fri Mar 03 2023 18:59:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト