On the Effectiveness of Out-of-Distribution Data in Self-Supervised Long-Tail Learning

Jianhong Bai; Zuozhu Liu; Hualiang Wang; Jin Hao; Yang Feng; Huanpeng Chu; Haoji Hu

自己教師ありロングテール学習における分布外データの有効性について

自己教師あり学習 (SSL) は、表現学習の有望な手法として広く研究されていますが、多数のクラスが特徴空間を支配しているため、ロングテールデータセットではうまく一般化できません。最近の研究では、自己教師ありトレーニング用に追加のドメイン内 (ID) データをサンプリングすることでロングテール学習のパフォーマンスを向上できることが示されていますが、少数クラスのバランスを再調整できる大規模な ID データの収集には費用がかかります。このペーパーでは、代替でありながら使いやすく効果的なソリューションである、ロングテール学習 (COLT) のための分布外 (OOD) データとの対比を提案します。これは、OOD データを効果的に活用して、動的にバランスを再調整することができます。特集スペース。私たちは、SSL ロングテール学習における OOD サンプルの直感に反する有用性を経験的に特定し、主に新しい SSL メソッドを設計します。具体的には、まず、特徴空間内の近傍に基づいて各 OOD サンプルに末尾スコアを割り当てることにより、「先頭」サンプルと「末尾」サンプルの位置を特定します。次に、特徴空間のバランスを動的に再調整するオンライン OOD サンプリング戦略を提案します。最後に、分布レベルの教師付きコントラスト損失によって ID サンプルと OOD サンプルを区別できるようにモデルを強制します。提案された方法の有効性を検証するために、さまざまなデータセットといくつかの最先端の SSL フレームワークに対して広範な実験が行われています。結果は、私たちの方法がロングテールデータセットでの SSL のパフォーマンスを大幅に向上させ、外部 ID データを使用した以前の研究よりも優れていることを示しています。私たちのコードは https://github.com/JianhongBai/COLT で入手できます。

Though Self-supervised learning (SSL) has been widely studied as a promising technique for representation learning, it doesn't generalize well on long-tailed datasets due to the majority classes dominating the feature space. Recent work shows that the long-tailed learning performance could be boosted by sampling extra in-domain (ID) data for self-supervised training, however, large-scale ID data which can rebalance the minority classes are expensive to collect. In this paper, we propose an alternative but easy-to-use and effective solution, Contrastive with Out-of-distribution (OOD) data for Long-Tail learning (COLT), which can effectively exploit OOD data to dynamically re-balance the feature space. We empirically identify the counter-intuitive usefulness of OOD samples in SSL long-tailed learning and principally design a novel SSL method. Concretely, we first localize the `head' and `tail' samples by assigning a tailness score to each OOD sample based on its neighborhoods in the feature space. Then, we propose an online OOD sampling strategy to dynamically re-balance the feature space. Finally, we enforce the model to be capable of distinguishing ID and OOD samples by a distribution-level supervised contrastive loss. Extensive experiments are conducted on various datasets and several state-of-the-art SSL frameworks to verify the effectiveness of the proposed method. The results show that our method significantly improves the performance of SSL on long-tailed datasets by a large margin, and even outperforms previous work which uses external ID data. Our code is available at https://github.com/JianhongBai/COLT.

updated: Thu Jun 08 2023 04:32:10 GMT+0000 (UTC)

published: Thu Jun 08 2023 04:32:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト