A New Perspective for Understanding Generalization Gap of Deep Neural Networks Trained with Large Batch Sizes

Oyebade K. Oyedotun; Konstantinos Papadopoulos; Djamila Aouada

大きなバッチサイズでトレーニングされたディープニューラルネットワークの汎化ギャップを理解するための新しい視点

ディープニューラルネットワーク (DNN) は通常、さまざまな形式のミニバッチ勾配降下アルゴリズムを使用して最適化されます。ミニバッチ勾配降下の主な動機は、適切に選択されたバッチサイズを使用して、利用可能なコンピューティングリソースを (並列化を含めて) 高速なモデルトレーニングに最適に利用できることです。ただし、多くの研究では、トレーニングバッチサイズが制限を超えて増加すると、モデルの一般化が徐々に失われることが報告されています。これは一般的に汎化ギャップと呼ばれるシナリオです。いくつかの研究では、一般化のギャップの問題を軽減するためのさまざまな方法が提案されていますが、一般化のギャップを理解するための満場一致の説明はまだ文献に欠けています。最近の研究では、学習率のスケーリングやトレーニング予算の増加などの一般化ギャップの問題に対するいくつかの提案されたソリューションが実際には解決しないことが観察されていることを考えると、これは特に重要です。そのため、このホワイトペーパーでの主な説明は、大きなバッチサイズでトレーニングされた DNN の汎化損失の原因を調査し、新しい視点を提供することです。私たちの分析は、トレーニングバッチサイズが大きいと、ユニットのアクティベーション (出力) テンソルのニアランク損失が増加し、その結果、モデルの最適化と一般化に影響を与えることを示唆しています。 CIFAR-10、CIFAR-100、Fashion-MNIST、および MNIST データセットを使用して、VGG-16、残差ネットワーク (ResNet-56)、LeNet-5 などの一般的な DNN モデルの検証のために広範な実験が実行されます。

Deep neural networks (DNNs) are typically optimized using various forms of mini-batch gradient descent algorithm. A major motivation for mini-batch gradient descent is that with a suitably chosen batch size, available computing resources can be optimally utilized (including parallelization) for fast model training. However, many works report the progressive loss of model generalization when the training batch size is increased beyond some limits. This is a scenario commonly referred to as generalization gap. Although several works have proposed different methods for alleviating the generalization gap problem, a unanimous account for understanding generalization gap is still lacking in the literature. This is especially important given that recent works have observed that several proposed solutions for generalization gap problem such learning rate scaling and increased training budget do not indeed resolve it. As such, our main exposition in this paper is to investigate and provide new perspectives for the source of generalization loss for DNNs trained with a large batch size. Our analysis suggests that large training batch size results in increased near-rank loss of units' activation (i.e. output) tensors, which consequently impacts model optimization and generalization. Extensive experiments are performed for validation on popular DNN models such as VGG-16, residual network (ResNet-56) and LeNet-5 using CIFAR-10, CIFAR-100, Fashion-MNIST and MNIST datasets.

updated: Fri Oct 21 2022 18:23:12 GMT+0000 (UTC)

published: Fri Oct 21 2022 18:23:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト