Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers

Gabriele Prato; Simon Guiroy; Ethan Caballero; Irina Rish; Sarath Chandar

事前に訓練された画像分類器の少数ショット適応のためのスケーリング則

神経スケーリング法則の経験的科学は、特にGPT-3、CLIP、DALL-eなどの大規模な事前トレーニング済みモデルによって達成された最近のブレークスルーに照らして、機械学習の将来にとって非常に重要な急速に成長している分野です。データ、計算、モデルサイズなどのリソースを増やしてニューラルネットワークのパフォーマンスを正確に予測すると、固定サイズのベンチマークでの固定サイズのモデルの従来のポイントごとの比較とは対照的に、複数のスケールにわたるさまざまなアプローチのより包括的な評価が提供されます。最も重要なことは、最適なスケーリング、したがって将来的に最も有望なアプローチに焦点を当てることができることです。この作業では、特に、新しい画像クラスが含まれているという意味で、数ショットフェーズのターゲットデータ分布がソース、トレーニング、データ分布と異なる場合に、画像分類における数ショット学習の難しい問題を検討します。トレーニング中に遭遇しませんでした。現在の主な目標は、事前トレーニングデータの量が、標準的な画像分類器の数ショットの一般化パフォーマンスにどのように影響するかを調査することです。私たちの重要な観察は、（1）トレーニングセットのサイズが大きくなるにつれて、そのようなパフォーマンスの改善はべき法則（線形両対数プロット）によって十分に近似されることです。（2）これは、同じまたはからのターゲットデータの両方の場合に当てはまります。トレーニングデータとしての異なるドメイン（つまり、新しいクラス）、および（3）新しいクラスでの数ショットのパフォーマンスは、以前に見られたクラスでの標準の分類パフォーマンスよりも速い速度で収束します。私たちの調査結果は、規模と一般化の関係に新たな光を当てました。

Empirical science of neural scaling laws is a rapidly growing area of significant importance to the future of machine learning, particularly in the light of recent breakthroughs achieved by large-scale pre-trained models such as GPT-3, CLIP and DALL-e. Accurately predicting the neural network performance with increasing resources such as data, compute and model size provides a more comprehensive evaluation of different approaches across multiple scales, as opposed to traditional point-wise comparisons of fixed-size models on fixed-size benchmarks, and, most importantly, allows for focus on the best-scaling, and thus most promising in the future, approaches. In this work, we consider a challenging problem of few-shot learning in image classification, especially when the target data distribution in the few-shot phase is different from the source, training, data distribution, in a sense that it includes new image classes not encountered during training. Our current main goal is to investigate how the amount of pre-training data affects the few-shot generalization performance of standard image classifiers. Our key observations are that (1) such performance improvements are well-approximated by power laws (linear log-log plots) as the training set size increases, (2) this applies to both cases of target data coming from either the same or from a different domain (i.e., new classes) as the training data, and (3) few-shot performance on new classes converges at a faster rate than the standard classification performance on previously seen classes. Our findings shed new light on the relationship between scale and generalization.

updated: Wed Oct 13 2021 19:07:01 GMT+0000 (UTC)

published: Wed Oct 13 2021 19:07:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト