Toward Multi-Diversified Ensemble Clustering of High-Dimensional Data: From Subspaces to Metrics and Beyond

Dong Huang; Chang-Dong Wang; Jian-Huang Lai; Chee-Keong Kwoh

高次元データの多多様化アンサンブルクラスタリングに向けて：部分空間からメトリックまで、そしてそれを超えて

さまざまな分野での高次元データの急速な出現は、現在のアンサンブルクラスタリング研究に新たな課題をもたらしました。次元の呪いに対処するために、最近、さまざまな部分空間ベースの手法を使用して、アンサンブルクラスタリングにかなりの努力が払われています。ただし、部分空間の強調に加えて、類似性/非類似性メトリックの潜在的な多様性にはかなり限定的な注意が払われています。多様化されたメトリックの大規模な母集団を作成および集約する方法、さらに、統一されたフレームワーク内のメトリクス、部分空間、およびクラスターの大規模な母集団におけるマルチレベルの多様性を共同で調査する方法は、アンサンブルクラスタリングにおいて驚くほど未解決の問題です。この問題に取り組むために、この論文は、新しい多多様化アンサンブルクラスタリングアプローチを提案します。特に、スケーリングされた指数類似性カーネルをランダム化することにより、多数の多様なメトリックを作成します。これらのカーネルは、ランダムなサブスペースと結合されて、メトリックとサブスペースのペアの大きなセットを形成します。これらのメトリック-サブスペースペアから導出された類似性行列に基づいて、多様化されたベースクラスタリングのアンサンブルを構築できます。さらに、エントロピーベースの基準を使用して、アンサンブルのクラスターごとの多様性を調査します。これに基づいて、3つのタイプのコンセンサス関数を組み込むことにより、3つの特定のアンサンブルクラスタリングアルゴリズムが提示されます。 18の癌遺伝子発現データセットと12の画像/音声データセットを含む30の高次元データセットで広範な実験が行われ、最先端のアルゴリズムに対する当社のアルゴリズムの優位性が実証されています。ソースコードはhttps://github.com/huangdonghere/MDECで入手できます。

The rapid emergence of high-dimensional data in various areas has brought new challenges to current ensemble clustering research. To deal with the curse of dimensionality, recently considerable efforts in ensemble clustering have been made by means of different subspace-based techniques. However, besides the emphasis on subspaces, rather limited attention has been paid to the potential diversity in similarity/dissimilarity metrics. It remains a surprisingly open problem in ensemble clustering how to create and aggregate a large population of diversified metrics, and furthermore, how to jointly investigate the multi-level diversity in the large populations of metrics, subspaces, and clusters in a unified framework. To tackle this problem, this paper proposes a novel multi-diversified ensemble clustering approach. In particular, we create a large number of diversified metrics by randomizing a scaled exponential similarity kernel, which are then coupled with random subspaces to form a large set of metric-subspace pairs. Based on the similarity matrices derived from these metric-subspace pairs, an ensemble of diversified base clusterings can thereby be constructed. Further, an entropy-based criterion is utilized to explore the cluster-wise diversity in ensembles, based on which three specific ensemble clustering algorithms are presented by incorporating three types of consensus functions. Extensive experiments are conducted on 30 high-dimensional datasets, including 18 cancer gene expression datasets and 12 image/speech datasets, which demonstrate the superiority of our algorithms over the state-of-the-art. The source code is available at https://github.com/huangdonghere/MDEC.

updated: Tue Jan 05 2021 01:49:43 GMT+0000 (UTC)

published: Mon Oct 09 2017 14:19:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト