DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

Weijia Wu; Yuzhong Zhao; Mike Zheng Shou; Hong Zhou; Chunhua Shen

DiffuMask: 拡散モデルを使用したセマンティックセグメンテーションのためのピクセルレベルの注釈を含む画像の合成

ピクセル単位のラベルで画像を収集して注釈を付けるには、時間と労力がかかります。対照的に、合成データは、生成モデル (DALL-E、Stable Diffusion など) を使用して自由に利用できます。この論文では、トレーニング中にテキストと画像のペアのみを使用する既製の安定拡散モデルによって生成された合成画像の正確なセマンティックマスクを自動的に取得できることを示します。 DiffuMask と呼ばれる私たちのアプローチは、テキストと画像の間のクロスアテンションマップの可能性を活用します。これは、テキスト駆動型の画像合成をセマンティックマスク生成に拡張するために自然でシームレスです。 DiffuMask は、テキストガイド付きの相互注意情報を使用してクラス/単語固有の領域をローカライズし、実用的な手法と組み合わせて、新しい高解像度でクラス識別可能なピクセル単位のマスクを作成します。これらの方法は、データ収集と注釈のコストを明らかに削減するのに役立ちます。実験は、DiffuMask の合成データでトレーニングされた既存のセグメンテーション方法が、実際のデータの対応物よりも優れたパフォーマンスを達成できることを示しています (VOC 2012、Cityscapes)。一部のクラス (鳥など) では、DiffuMask は、実際のデータの最先端の結果に近い (3% mIoU ギャップ内) 有望なパフォーマンスを示します。さらに、オープン語彙セグメンテーション (ゼロショット) 設定では、DiffuMask は VOC 2012 の Unseen クラスで新しい SOTA 結果を達成しました。プロジェクトの Web サイトは https://weijiawu.github.io/DiffusionMask/ にあります。

Collecting and annotating images with pixel-wise labels is time-consuming and laborious. In contrast, synthetic data can be freely available using a generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model, which uses only text-image pairs during training. Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image, which is natural and seamless to extend the text-driven image synthesis to semantic mask generation. DiffuMask uses text-guided cross-attention information to localize class/word-specific regions, which are combined with practical techniques to create a novel high-resolution and class-discriminative pixel-wise mask. The methods help to reduce data collection and annotation costs obviously. Experiments demonstrate that the existing segmentation methods trained on synthetic data of DiffuMask can achieve a competitive performance over the counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird), DiffuMask presents promising performance, close to the stateof-the-art result of real data (within 3% mIoU gap). Moreover, in the open-vocabulary segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on Unseen class of VOC 2012. The project website can be found at https://weijiawu.github.io/DiffusionMask/.

updated: Fri Aug 11 2023 09:44:04 GMT+0000 (UTC)

published: Tue Mar 21 2023 08:43:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト