Learning from Training Dynamics: Identifying Mislabeled Data Beyond Manually Designed Features

Qingrui Jia; Xuhong Li; Lei Yu; Jiang Bian; Penghao Zhao; Shupeng Li; Haoyi Xiong; Dejing Dou

トレーニングダイナミクスから学ぶ: 手動で設計された機能を超える、誤ったラベルが付けられたデータの特定

トレーニングセット内の誤ったラベル付けまたはあいまいなラベル付けのサンプルは、ディープモデルのパフォーマンスに悪影響を及ぼす可能性がありますが、データセットを診断し、誤ってラベル付けされたサンプルを特定することは、一般化の能力を向上させるのに役立ちます。トレーニングダイナミクス、つまり最適化アルゴリズムの反復によって残されたトレースは、最近、手作業で作成された機能を使用して誤ってラベル付けされたサンプルの位置を特定するのに効果的であることが証明されました。このホワイトペーパーでは、手動で設計された機能を超えて、LSTM ネットワークによってインスタンス化されたノイズ検出器を活用する新しい学習ベースのソリューションを紹介します。これは、生のトレーニングダイナミクスを入力として使用して、サンプルが誤ってラベル付けされているかどうかを予測することを学習します。具体的には、提案された方法は、合成されたラベルノイズを含むデータセットを使用して、監視された方法でノイズ検出器をトレーニングし、再トレーニングなしでさまざまなデータセット (自然または合成されたラベルノイズ) に適応できます。提案された方法を評価するために広範な実験を行います。合成されたラベルノイズ付き CIFAR データセットに基づいてノイズ検出器をトレーニングし、Tiny ImageNet、CUB-200、Caltech-256、WebVision、および Clothing1M でそのようなノイズ検出器をテストします。結果は、提案された方法が、さらに適応することなく、さまざまなデータセットで誤ってラベル付けされたサンプルを正確に検出し、最先端の方法よりも優れていることを示しています。その上、より多くの実験は、誤ラベルの識別がラベルの修正、つまりデータのデバッグを導くことができることを示しており、データの側面からアルゴリズム中心の最先端技術の直交する改善を提供します。

While mislabeled or ambiguously-labeled samples in the training set could negatively affect the performance of deep models, diagnosing the dataset and identifying mislabeled samples helps to improve the generalization power. Training dynamics, i.e., the traces left by iterations of optimization algorithms, have recently been proved to be effective to localize mislabeled samples with hand-crafted features. In this paper, beyond manually designed features, we introduce a novel learning-based solution, leveraging a noise detector, instanced by an LSTM network, which learns to predict whether a sample was mislabeled using the raw training dynamics as input. Specifically, the proposed method trains the noise detector in a supervised manner using the dataset with synthesized label noises and can adapt to various datasets (either naturally or synthesized label-noised) without retraining. We conduct extensive experiments to evaluate the proposed method. We train the noise detector based on the synthesized label-noised CIFAR dataset and test such noise detector on Tiny ImageNet, CUB-200, Caltech-256, WebVision and Clothing1M. Results show that the proposed method precisely detects mislabeled samples on various datasets without further adaptation, and outperforms state-of-the-art methods. Besides, more experiments demonstrate that the mislabel identification can guide a label correction, namely data debugging, providing orthogonal improvements of algorithm-centric state-of-the-art techniques from the data aspect.

updated: Tue Dec 20 2022 06:37:00 GMT+0000 (UTC)

published: Mon Dec 19 2022 09:39:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト