arXiv reaDer
Efficient Lifelong Model Evaluation in an Era of Rapid Progress
Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. These benchmarks introduce a major challenge: the high cost of evaluating a growing number of models across very large sample sets. To address this challenge, we introduce an efficient framework for model evaluation, Sort & Search (S&S)}, which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples. To test our approach at scale, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing 1.69M and 1.98M test samples for classification. Extensive empirical evaluations across over 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (about 1000x reduction) on a single A100 GPU, with low approximation error and memory cost of <100MB. Our work also highlights issues with current accuracy prediction metrics, suggesting a need to move towards sample-level evaluation metrics. We hope to guide future research by showing our method's bottleneck lies primarily in generalizing Sort beyond a single rank order and not in improving Search.
updated: Sat Nov 23 2024 22:30:55 GMT+0000 (UTC)
published: Thu Feb 29 2024 18:58:26 GMT+0000 (UTC)
参考文献 (このサイトで利用可能なもの) / References (only if available on this site)
被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)
Amazon.co.jpアソシエイト