TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation

Zhaoyuan Yin; Pichao Wang; Fan Wang; Xianzhe Xu; Hanling Zhang; Hao Li; Rong Jin

TransFGU：きめ細かい教師なしセマンティックセグメンテーションへのトップダウンアプローチ

教師なしセマンティックセグメンテーションは、手動の注釈なしで低レベルの視覚的特徴に関する高レベルのセマンティック表現を取得することを目的としています。ほとんどの既存の方法は、視覚的な手がかりまたは特定の事前定義されたルールに基づいてピクセルを領域にグループ化しようとするボトムアップアプローチです。結果として、これらのボトムアップアプローチでは、複数のオブジェクトといくつかのオブジェクトが同様の視覚的外観を共有する複雑なシーンに到達したときに、きめ細かいセマンティックセグメンテーションを生成することは困難です。対照的に、非常に複雑なシナリオでのきめ細かいセグメンテーションのための最初のトップダウン教師なしセマンティックセグメンテーションフレームワークを提案します。具体的には、まず、大規模なビジョンデータから自己監視学習方式で豊富な高レベルの構造化された意味概念情報を取得し、そのような情報を事前に使用して、ターゲットデータセットに提示される潜在的な意味カテゴリを発見します。次に、検出された高レベルのセマンティックカテゴリは、特定の検出されたセマンティック表現に関してクラスアクティブ化マップ（CAM）を計算することにより、低レベルのピクセルフィーチャにマップされます。最後に、取得したCAMは、セグメンテーションモジュールをトレーニングし、最終的なセグメンテーションセグメンテーションを生成するための疑似ラベルとして機能します。複数のセマンティックセグメンテーションベンチマークの実験結果は、トップダウンの教師なしセグメンテーションが、さまざまなセマンティック粒度レベルでオブジェクト中心のデータセットとシーン中心のデータセットの両方に対して堅牢であり、現在のすべての最先端のボトムアップ手法よりも優れていることを示しています。私たちのコードはhttps://github.com/damo-cv/TransFGUで入手できます。

Unsupervised semantic segmentation aims to obtain high-level semantic representation on low-level visual features without manual annotations. Most existing methods are bottom-up approaches that try to group pixels into regions based on their visual cues or certain predefined rules. As a result, it is difficult for these bottom-up approaches to generate fine-grained semantic segmentation when coming to complicated scenes with multiple objects and some objects sharing similar visual appearance. In contrast, we propose the first top-down unsupervised semantic segmentation framework for fine-grained segmentation in extremely complicated scenarios. Specifically, we first obtain rich high-level structured semantic concept information from large-scale vision data in a self-supervised learning manner, and use such information as a prior to discover potential semantic categories presented in target datasets. Secondly, the discovered high-level semantic categories are mapped to low-level pixel features by calculating the class activate map (CAM) with respect to certain discovered semantic representation. Lastly, the obtained CAMs serve as pseudo labels to train the segmentation module and produce the final semantic segmentation. Experimental results on multiple semantic segmentation benchmarks show that our top-down unsupervised segmentation is robust to both object-centric and scene-centric datasets under different semantic granularity levels, and outperforms all the current state-of-the-art bottom-up methods. Our code is available at https://github.com/damo-cv/TransFGU.

updated: Fri Jul 22 2022 23:01:32 GMT+0000 (UTC)

published: Thu Dec 02 2021 18:59:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト