UCSL : A Machine Learning Expectation-Maximization framework for Unsupervised Clustering driven by Supervised Learning

Robin Louiset; Pietro Gori; Benoit Dufumier; Josselin Houenou; Antoine Grigis; Edouard Duchesnay

UCSL：機械学習の期待値-教師あり学習によって駆動される教師なしクラスタリングの最大化フレームワーク

サブタイプの検出は、データセットの解釈可能で一貫性のあるサブパートを見つけることで構成されます。これらのサブパートは、特定の監視対象タスクにも関連しています。数学的な観点から、これは、教師あり予測に沿ってサブグループを明らかにするために、教師あり学習によって駆動されるクラスタリングタスクとして定義できます。この論文では、UCSL（教師あり学習によって駆動される教師なしクラスタリング）というタイトルの一般的な期待値最大化アンサンブルフレームワークを提案します。私たちの方法は一般的であり、任意のクラスタリング方法を統合でき、二項分類と回帰の両方で駆動できます。クラスターごとに1つずつ、複数の線形推定量をマージすることにより、非線形モデルを構築することを提案します。各超平面は、1つのクラスターのみを正しく識別（または予測）するように推定されます。分類にはSVCまたはロジスティック回帰を使用し、回帰にはSVRを使用します。さらに、より適切な空間内でクラスター分析を実行するために、監視対象タスクに関連する正規直交空間にデータを投影する次元削減アルゴリズムも提案します。合成データセットと実験データセットを使用して、アルゴリズムの堅牢性と一般化機能を分析します。特に、既知のグラウンドトゥルースラベルを使用して精神疾患クラスター分析を実行することにより、適切な一貫性のあるサブタイプを識別する能力を検証します。以前の最先端技術に対する提案された方法の利得は、バランスのとれた精度に関して約+1.9ポイントです。最後に、コードと例をscikit-learn互換のPythonパッケージ（https://github.com/neurospin-projects/2021_rlouiset_ucsl）で利用できるようにします。

Subtype Discovery consists in finding interpretable and consistent sub-parts of a dataset, which are also relevant to a certain supervised task. From a mathematical point of view, this can be defined as a clustering task driven by supervised learning in order to uncover subgroups in line with the supervised prediction. In this paper, we propose a general Expectation-Maximization ensemble framework entitled UCSL (Unsupervised Clustering driven by Supervised Learning). Our method is generic, it can integrate any clustering method and can be driven by both binary classification and regression. We propose to construct a non-linear model by merging multiple linear estimators, one per cluster. Each hyperplane is estimated so that it correctly discriminates - or predict - only one cluster. We use SVC or Logistic Regression for classification and SVR for regression. Furthermore, to perform cluster analysis within a more suitable space, we also propose a dimension-reduction algorithm that projects the data onto an orthonormal space relevant to the supervised task. We analyze the robustness and generalization capability of our algorithm using synthetic and experimental datasets. In particular, we validate its ability to identify suitable consistent sub-types by conducting a psychiatric-diseases cluster analysis with known ground-truth labels. The gain of the proposed method over previous state-of-the-art techniques is about +1.9 points in terms of balanced accuracy. Finally, we make codes and examples available in a scikit-learn-compatible Python package at https://github.com/neurospin-projects/2021_rlouiset_ucsl

updated: Mon Jul 05 2021 12:55:13 GMT+0000 (UTC)

published: Mon Jul 05 2021 12:55:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト