ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation

Kehan Li; Zhennan Wang; Zesen Cheng; Runyi Yu; Yian Zhao; Guoli Song; Chang Liu; Li Yuan; Jie Chen

ACSeg: 教師なしセマンティックセグメンテーションのための適応的概念化

最近、自己教師ありの大規模な視覚的事前トレーニングモデルは、ピクセルレベルのセマンティックリレーションシップを表現する上で大きな可能性を示しており、教師なしセマンティックセグメンテーション (USS) などの教師なし高密度予測タスクの開発を大幅に促進しています。ピクセルレベルの表現間の抽出された関係には、通常、表現空間内の意味的に同一のピクセル埋め込みが集まって洗練された概念を形成する、豊富なクラス認識情報が含まれています。ただし、学習したモデルを活用して、画像内の意味的に一貫したピクセルグループまたは領域を確認することは自明ではありません。これは、異なる画像のさまざまな意味分布の下で、オーバー/アンダークラスタリングが概念化手順を圧倒するためです。この作業では、画像セグメンテーションとして自己教師あり ViT 事前トレーニング済みモデルのピクセルレベルのセマンティック集約を調査し、ACSeg と呼ばれる USS の適応概念化アプローチを提案します。具体的には、概念を学習可能なプロトタイプに明示的にエンコードし、これらのプロトタイプを各画像の有益な概念に適応的にマッピングする Adaptive Concept Generator (ACG) を設計します。一方、異なる画像のシーンの複雑さを考慮して、同じ概念に属するピクセルペアの強度の推定に基づいて、概念数に関係なくACGを最適化するモジュール性損失を提案します。最後に、USS タスクを、発見された概念を教師なしで分類することに変えます。最先端の結果による広範な実験により、提案された ACSeg の有効性が実証されています。

Recently, self-supervised large-scale visual pre-training models have shown great promise in representing pixel-level semantic relationships, significantly promoting the development of unsupervised dense prediction tasks, e.g., unsupervised semantic segmentation (USS). The extracted relationship among pixel-level representations typically contains rich class-aware information that semantically identical pixel embeddings in the representation space gather together to form sophisticated concepts. However, leveraging the learned models to ascertain semantically consistent pixel groups or regions in the image is non-trivial since over/ under-clustering overwhelms the conceptualization procedure under various semantic distributions of different images. In this work, we investigate the pixel-level semantic aggregation in self-supervised ViT pre-trained models as image Segmentation and propose the Adaptive Conceptualization approach for USS, termed ACSeg. Concretely, we explicitly encode concepts into learnable prototypes and design the Adaptive Concept Generator (ACG), which adaptively maps these prototypes to informative concepts for each image. Meanwhile, considering the scene complexity of different images, we propose the modularity loss to optimize ACG independent of the concept number based on estimating the intensity of pixel pairs belonging to the same concept. Finally, we turn the USS task into classifying the discovered concepts in an unsupervised manner. Extensive experiments with state-of-the-art results demonstrate the effectiveness of the proposed ACSeg.

updated: Sat Nov 19 2022 05:29:10 GMT+0000 (UTC)

published: Wed Oct 12 2022 06:16:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト