Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

Ryan Burgert; Kanchana Ranasinghe; Xiang Li; Michael S. Ryoo

Peekaboo: テキストから画像への拡散モデルはゼロショットセグメンターです

視覚言語モデルと組み合わせた最近の拡散ベースの生成モデルは、自然言語プロンプトから現実的な画像を作成することができます。これらのモデルは大規模なインターネット規模のデータセットでトレーニングされますが、そのような事前トレーニング済みのモデルは、セマンティックローカリゼーションやグラウンディングに直接導入されることはありません。ローカリゼーションまたはグラウンディングの現在のアプローチのほとんどは、境界ボックスまたはセグメンテーションマスクの形で人間が注釈を付けたローカリゼーション情報に依存しています。例外は、ローカリゼーション向けのアーキテクチャまたは損失関数を利用するいくつかの教師なしメソッドですが、これらは個別にトレーニングする必要があります。この作業では、そのようなローカリゼーション情報にさらされることなくトレーニングされた既製の拡散モデルが、セグメンテーション固有の再トレーニングなしでさまざまなセマンティックフレーズをグラウンディングできる方法を探ります。自然言語を条件としたセグメンテーションマスクを生成できる、推論時間の最適化プロセスが導入されています。 Pascal VOC データセットでの教師なしセマンティックセグメンテーションの提案 Peekaboo を評価します。さらに、RefCOCO データセットの参照セグメンテーションを評価します。要約すると、再トレーニングなしで拡散ベースの生成モデルを活用する、最初のゼロショット、オープン語彙、教師なし (ローカリゼーション情報なし)、セマンティックグラウンディング手法を提示します。私たちのコードは公開されます。

Recent diffusion-based generative models combined with vision-language models are capable of creating realistic images from natural language prompts. While these models are trained on large internet-scale datasets, such pre-trained models are not directly introduced to any semantic localization or grounding. Most current approaches for localization or grounding rely on human-annotated localization information in the form of bounding boxes or segmentation masks. The exceptions are a few unsupervised methods that utilize architectures or loss functions geared towards localization, but they need to be trained separately. In this work, we explore how off-the-shelf diffusion models, trained with no exposure to such localization information, are capable of grounding various semantic phrases with no segmentation-specific re-training. An inference time optimization process is introduced, that is capable of generating segmentation masks conditioned on natural language. We evaluate our proposal Peekaboo for unsupervised semantic segmentation on the Pascal VOC dataset. In addition, we evaluate for referring segmentation on the RefCOCO dataset. In summary, we present a first zero-shot, open-vocabulary, unsupervised (no localization information), semantic grounding technique leveraging diffusion-based generative models with no re-training. Our code will be released publicly.

updated: Wed Nov 23 2022 18:59:05 GMT+0000 (UTC)

published: Wed Nov 23 2022 18:59:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト