Improving Semi-supervised Deep Learning by using Automatic Thresholding to Deal with Out of Distribution Data for COVID-19 Detection using Chest X-ray Images

Isaac Benavides-Mata; Saul Calderon-Ramirez

胸部 X 線画像を使用した COVID-19 検出用の分布外データを処理するために自動しきい値処理を使用して、半教師あり深層学習を改善する

半教師あり学習 (SSL) は、ラベル付けされたデータが限られており、ラベル付けされていないデータが膨大な場合、モデルのトレーニングにラベル付けされたデータとラベル付けされていないデータの両方を活用します。多くの場合、ラベルなしデータはラベル付きデータよりも広く利用できるため、このデータは、ラベル付きデータが不足している場合にモデルの一般化のレベルを向上させるために使用されます。ただし、実際の設定では、ラベルのないデータは、ラベルの付いたデータセットの分布とは異なる分布を表す場合があります。これは、分布の不一致として知られています。このような問題は、通常、ラベルのないデータのソースがラベルの付いたデータと異なる場合に発生します。たとえば、医用画像領域では、胸部 X 線画像を使用して COVID-19 検出器をトレーニングするときに、さまざまな病院からサンプリングされたさまざまなラベル付けされていないデータセットが使用される可能性があります。この作業では、ラベル付けされていないデータセットの分布外データをフィルター処理するための自動しきい値処理方法を提案します。事前トレーニング済みの Image-net Feature Extractor (FE) によって構築された特徴空間を使用して、ラベル付けされたデータセットとラベル付けされていないデータセットの間のマハラノビス距離を使用して、ラベル付けされていない各観測値をスコア付けします。胸部 X 線画像を使用して COVID-19 検出器をトレーニングするというコンテキストで、2 つの単純な自動しきい値処理方法をテストします。テストされたメソッドは、半教師あり深層学習アーキテクチャをトレーニングするときに、どのラベルなしデータを保持するかを自動的に定義する方法を提供します。

Semi-supervised learning (SSL) leverages both labeled and unlabeled data for training models when the labeled data is limited and the unlabeled data is vast. Frequently, the unlabeled data is more widely available than the labeled data, hence this data is used to improve the level of generalization of a model when the labeled data is scarce. However, in real-world settings unlabeled data might depict a different distribution than the labeled dataset distribution. This is known as distribution mismatch. Such problem generally occurs when the source of unlabeled data is different from the labeled data. For instance, in the medical imaging domain, when training a COVID-19 detector using chest X-ray images, different unlabeled datasets sampled from different hospitals might be used. In this work, we propose an automatic thresholding method to filter out-of-distribution data in the unlabeled dataset. We use the Mahalanobis distance between the labeled and unlabeled datasets using the feature space built by a pre-trained Image-net Feature Extractor (FE) to score each unlabeled observation. We test two simple automatic thresholding methods in the context of training a COVID-19 detector using chest X-ray images. The tested methods provide an automatic manner to define what unlabeled data to preserve when training a semi-supervised deep learning architecture.

updated: Thu Nov 03 2022 20:56:45 GMT+0000 (UTC)

published: Thu Nov 03 2022 20:56:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト