Semantic-Enhanced Image Clustering

Shaotian Cai; Liping Qiu; Xiaojun Chen; Qin Zhang; Longteng Chen

セマンティック強化画像クラスタリング

イメージクラスタリングは、コンピュータービジョンにおける重要かつ挑戦的なタスクです。画像クラスタリングタスクを解決するために多くの方法が提案されていますが、それらは画像を探索し、画像の特徴に従ってクラスターを明らかにするだけであり、視覚的に類似しているが意味的に異なる画像を区別することはできません。この論文では、視覚言語の事前トレーニングモデルを使用して画像クラスタリングのタスクを調査することを提案します。クラス名がわかっているゼロショット設定とは異なり、この設定ではクラスターの数しかわかりません。したがって、画像を適切な意味空間にマッピングする方法と、画像と意味空間の両方から画像をクラスター化する方法は、2 つの重要な問題です。上記の問題を解決するために、セマンティック強化画像クラスタリング (SIC) という名前の視覚言語事前トレーニングモデル CLIP によって導かれる新しい画像クラスタリング方法を提案します。この新しい方法では、最初に与えられた画像を適切な意味空間にマッピングする方法と、画像と意味の関係に従って疑似ラベルを生成する効率的な方法を提案します。最後に、自己教師あり学習方式で、画像空間と意味空間の両方で一貫性学習によるクラスタリングを実行することを提案します。収束解析の理論的結果は、提案した方法が準線形速度で収束できることを示しています。期待リスクの理論的分析は、近隣の一貫性を改善する、予測の信頼度を高める、または近隣の不均衡を減らすことによって、期待されるリスクを減らすことができることも示しています。 5 つのベンチマークデータセットでの実験結果は、新しい方法の優位性を明確に示しています。

Image clustering is an important and open-challenging task in computer vision. Although many methods have been proposed to solve the image clustering task, they only explore images and uncover clusters according to the image features, thus being unable to distinguish visually similar but semantically different images. In this paper, we propose to investigate the task of image clustering with the help of a visual-language pre-training model. Different from the zero-shot setting, in which the class names are known, we only know the number of clusters in this setting. Therefore, how to map images to a proper semantic space and how to cluster images from both image and semantic spaces are two key problems. To solve the above problems, we propose a novel image clustering method guided by the visual-language pre-training model CLIP, named Semantic-Enhanced Image Clustering (SIC). In this new method, we propose a method to map the given images to a proper semantic space first and efficient methods to generate pseudo-labels according to the relationships between images and semantics. Finally, we propose performing clustering with consistency learning in both image space and semantic space, in a self-supervised learning fashion. The theoretical result of convergence analysis shows that our proposed method can converge at a sublinear speed. Theoretical analysis of expectation risk also shows that we can reduce the expected risk by improving neighborhood consistency, increasing prediction confidence, or reducing neighborhood imbalance. Experimental results on five benchmark datasets clearly show the superiority of our new method.

updated: Sun Apr 09 2023 02:33:10 GMT+0000 (UTC)

published: Sun Aug 21 2022 09:04:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト