Identifying Mislabeled Images in Supervised Learning Utilizing Autoencoder

Yunhao Yang; Andrew Whinston

オートエンコーダを利用した教師あり学習での誤ったラベルの付いた画像の識別

教師あり学習は、トレーニングデータのグラウンドトゥルースが正確であるという仮定に基づいています。ただし、これは実際の設定では保証されない場合があります。トレーニングデータが不正確な場合、予期しない予測が発生します。画像分類では、ラベルが正しくないと分類モデルも不正確になる可能性があります。この論文では、分類ネットワークをトレーニングする前に、教師なし手法をトレーニングデータに適用します。畳み込みオートエンコーダは、画像のエンコードと再構築に適用されます。エンコーダーは画像データを潜在空間に投影します。潜在空間では、画像の特徴は低次元で保存されます。同様の機能を持つデータサンプルは同じラベルを持つ可能性が高いと想定されています。ノイズのあるサンプルは、密度ベーススキャン（DBSCAN）クラスタリングアルゴリズムによって潜在空間で分類できます。これらの誤ってラベル付けされたデータは、潜在空間の外れ値として視覚化されます。したがって、DBSCANアルゴリズムによって識別された外れ値は、誤ってラベル付けされたサンプルとして分類される可能性があります。外れ値が検出された後、すべての外れ値は誤ってラベル付けされたデータサンプルとして扱われ、データセットから削除されます。したがって、トレーニングデータは、教師あり学習ネットワークのトレーニングに直接使用できます。アルゴリズムは、実験データセット内の誤ってラベル付けされたデータの67％以上を検出して削除できます。

Supervised learning is based on the assumption that the ground truth in the training data is accurate. However, this may not be guaranteed in real-world settings. Inaccurate training data will result in some unexpected predictions. In image classification, incorrect labels may cause the classification model to be inaccurate as well. In this paper, I am going to apply unsupervised techniques to the training data before training the classification network. A convolutional autoencoder is applied to encode and reconstruct images. The encoder will project the image data on to latent space. In the latent space, image features are preserved in a lower dimension. The assumption is that data samples with similar features are likely to have the same label. Noised samples can be classified in the latent space by the Density-Base Scan (DBSCAN) clustering algorithm. These incorrectly labeled data are visualized as outliers in the latent space. Therefore, the outliers identified by the DBSCAN algorithm can be classified as incorrectly labeled samples. After the outliers are detected, all the outliers are treated as mislabeled data samples and removed from the dataset. Thus the training data can be directly used in training the supervised learning network. The algorithm can detect and remove above 67% of mislabeled data in the experimental dataset.

updated: Mon Jan 18 2021 22:59:44 GMT+0000 (UTC)

published: Sat Nov 07 2020 03:09:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト