The Value of Out-of-Distribution Data

Ashwin De Silva; Rahul Ramesh; Carey E. Priebe; Pratik Chaudhari; Joshua T. Vogelstein

配信外データの価値

より多くのデータが、タスクを一般化するのに役立つと予想されます。ただし、実際のデータセットには、分布外 (OOD) データが含まれる場合があります。これは、クラス内の変動性などの異質性の形で発生する可能性がありますが、時間的なシフトや概念のドリフトの形でも発生する可能性があります。このような問題の直感に反する現象を示します。タスクの一般化エラーは、OOD サンプル数の非単調関数になる可能性があります。少数の OOD サンプルは一般化を改善できますが、OOD サンプルの数がしきい値を超えると、一般化エラーが悪化する可能性があります。また、どのサンプルが OOD であるかがわかっている場合、ターゲットサンプルと OOD サンプルの間に重み付けされた目的を使用すると、汎化誤差が単調に減少することが保証されることも示します。 MNIST、CIFAR-10、CINIC-10、PACS、DomainNet などのビジョンベンチマークで、合成データセットの線形分類器と中規模のニューラルネットワークを使用して、この現象を実証および分析し、効果のデータ増強、ハイパーパラメーターの最適化、および事前検証を観察します。 -トレーニングはこの動作を持っています。

More data is expected to help us generalize to a task. But real datasets can contain out-of-distribution (OOD) data; this can come in the form of heterogeneity such as intra-class variability but also in the form of temporal shifts or concept drifts. We demonstrate a counter-intuitive phenomenon for such problems: generalization error of the task can be a non-monotonic function of the number of OOD samples; a small number of OOD samples can improve generalization but if the number of OOD samples is beyond a threshold, then the generalization error can deteriorate. We also show that if we know which samples are OOD, then using a weighted objective between the target and OOD samples ensures that the generalization error decreases monotonically. We demonstrate and analyze this phenomenon using linear classifiers on synthetic datasets and medium-sized neural networks on vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS, and DomainNet, and observe the effect data augmentation, hyperparameter optimization, and pre-training have on this behavior.

updated: Thu Oct 06 2022 10:12:00 GMT+0000 (UTC)

published: Tue Aug 23 2022 13:41:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト