More Than Meets The Eye: Semi-supervised Learning Under Non-IID Data

Saul Calderon-Ramirez; Luis Oala

目に見える以上のもの：非IIDデータの下での半教師あり学習

半教師あり深層学習（SSDL）の一般的なヒューリスティックは、ラベル付きデータとの意味的類似性の概念に基づいて、ラベルなしデータを選択することです。たとえば、番号のラベル付き画像は、たとえば車のラベルなし画像ではなく、番号のラベルなし画像とペアにする必要があります。この方法をセマンティックデータセットマッチングと呼びます。この作業では、セマンティックデータセットのマッチングの限界を示します。最先端のSSDLアルゴリズムのパフォーマンスを低下させることさえあることを示します。ラベル付きデータセットとラベルなしデータセットの間のさまざまな程度の分布の不一致の下でSSDLアルゴリズムのストレステストを行うために、非IID-SSDLと呼ばれる包括的なシミュレーションサンドボックスを提示して利用できるようにします。さらに、一般的な分類器の特徴空間における単純な密度ベースの非類似度測定が、SSDLトレーニングの前にラベルなしデータを選択するための有望で信頼性の高い定量的マッチング基準を提供することを示します。

A common heuristic in semi-supervised deep learning (SSDL) is to select unlabelled data based on a notion of semantic similarity to the labelled data. For example, labelled images of numbers should be paired with unlabelled images of numbers instead of, say, unlabelled images of cars. We refer to this practice as semantic data set matching. In this work, we demonstrate the limits of semantic data set matching. We show that it can sometimes even degrade the performance for a state of the art SSDL algorithm. We present and make available a comprehensive simulation sandbox, called non-IID-SSDL, for stress testing an SSDL algorithm under different degrees of distribution mismatch between the labelled and unlabelled data sets. In addition, we demonstrate that simple density based dissimilarity measures in the feature space of a generic classifier offer a promising and more reliable quantitative matching criterion to select unlabelled data before SSDL training.

updated: Tue Apr 20 2021 19:51:10 GMT+0000 (UTC)

published: Tue Apr 20 2021 19:51:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト