Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

Laurynas Karazija; Iro Laina; Andrea Vedaldi; Christian Rupprecht

ゼロショットオープン語彙セグメンテーションの普及モデル

現実世界のオブジェクトの多様性はほぼ無制限であるため、固定のカテゴリのセットでトレーニングされたモデルを使用してキャプチャすることは不可能です。その結果、近年、オープンボキャブラリー手法がコミュニティの関心を集めています。この論文では、ゼロショットのオープン語彙セグメンテーションのための新しい方法を提案します。これまでの研究は主に、画像とテキストのペアを使用した対比トレーニングに依存しており、グループ化メカニズムを活用して、言語に合わせてローカライズされた画像の特徴を学習していました。ただし、同様のキャプションを持つ画像の見た目は異なることが多いため、曖昧さが生じる可能性があります。代わりに、大規模なテキストから画像への拡散モデルの生成プロパティを利用して、特定のテキストカテゴリのサポート画像のセットをサンプリングします。これにより、特定のテキストの出現分布が提供され、曖昧さの問題が回避されます。さらに、オブジェクトの位置をより適切に特定し、背景を直接セグメント化するために、サンプリングされた画像のコンテキスト背景を考慮するメカニズムを提案します。私たちの方法を使用して、いくつかの既存の事前トレーニングされた自己教師あり特徴抽出器を自然言語で基礎化し、サポートセット内の領域にマッピングし直すことで説明可能な予測を提供できることを示します。私たちの提案はトレーニング不要で、事前トレーニングされたコンポーネントのみに依存していますが、さまざまなオープン語彙セグメンテーションベンチマークで優れたパフォーマンスを示し、Pascal VOC ベンチマークで 10% 以上のリードを獲得しています。

The variety of objects in the real world is nearly unlimited and is thus impossible to capture using models trained on a fixed set of categories. As a result, in recent years, open-vocabulary methods have attracted the interest of the community. This paper proposes a new method for zero-shot open-vocabulary segmentation. Prior work largely relies on contrastive training using image-text pairs, leveraging grouping mechanisms to learn image features that are both aligned with language and well-localised. This however can introduce ambiguity as the visual appearance of images with similar captions often varies. Instead, we leverage the generative properties of large-scale text-to-image diffusion models to sample a set of support images for a given textual category. This provides a distribution of appearances for a given text circumventing the ambiguity problem. We further propose a mechanism that considers the contextual background of the sampled images to better localise objects and segment the background directly. We show that our method can be used to ground several existing pre-trained self-supervised feature extractors in natural language and provide explainable predictions by mapping back to regions in the support set. Our proposal is training-free, relying on pre-trained components only, yet, shows strong performance on a range of open-vocabulary segmentation benchmarks, obtaining a lead of more than 10% on the Pascal VOC benchmark.

updated: Thu Jun 15 2023 17:51:28 GMT+0000 (UTC)

published: Thu Jun 15 2023 17:51:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト