When and Why Momentum Accelerates SGD:An Empirical Study

Jingwen Fu; Bohan Wang; Huishuai Zhang; Zhizheng Zhang; Wei Chen; Nanning Zheng

SGD の勢いが加速する時期と理由:実証研究

モーメンタムは深層学習オプティマイザーの重要なコンポーネントとなっており、モーメンタムがいつ、そしてなぜ確率的勾配降下法 (SGD) を加速するのかを包括的に理解する必要があります。「いつ」という問題に対処するために、有効学習率 η_ef (学習に対する運動量係数 μ とバッチサイズ b の影響を統合する概念) の下で SGD with Momentum (SGDM) のパフォーマンスを調べる有意義な比較フレームワークを確立します。率η。同じ有効学習率と同じバッチサイズの SGDM と SGD を比較すると、一貫したパターンが観察されます。η_ef が小さい場合、SGDM と SGD はほぼ同じ経験的トレーニング損失を経験します。 η_ef が特定のしきい値を超えると、SGDM は SGD よりもパフォーマンスが向上し始めます。さらに、バッチサイズが大きくなると、SGD に対する SGDM の利点がより顕著になることがわかります。「なぜ」という疑問については、運動量の加速が、更新方向に沿った方向性ヘッセ行列の突然のジャンプを表す突然のシャープ化と密接に関連していることがわかります。具体的には、SGD と SGDM 間の不整合は、SGD が急激に鋭くなり、収束が遅くなると同時に発生します。 Momentum は、突然のシャープ化の発生を防止または延期することにより、SGDM のパフォーマンスを向上させます。この研究では、運動量、学習率、バッチサイズの間の相互作用が明らかになり、運動量の加速についての理解が深まります。

Momentum has become a crucial component in deep learning optimizers, necessitating a comprehensive understanding of when and why it accelerates stochastic gradient descent (SGD). To address the question of ''when'', we establish a meaningful comparison framework that examines the performance of SGD with Momentum (SGDM) under the effective learning rates η_ef, a notion unifying the influence of momentum coefficient μ and batch size b over learning rate η. In the comparison of SGDM and SGD with the same effective learning rate and the same batch size, we observe a consistent pattern: when η_ef is small, SGDM and SGD experience almost the same empirical training losses; when η_ef surpasses a certain threshold, SGDM begins to perform better than SGD. Furthermore, we observe that the advantage of SGDM over SGD becomes more pronounced with a larger batch size. For the question of ``why'', we find that the momentum acceleration is closely related to abrupt sharpening which is to describe a sudden jump of the directional Hessian along the update direction. Specifically, the misalignment between SGD and SGDM happens at the same moment that SGD experiences abrupt sharpening and converges slower. Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening. Together, this study unveils the interplay between momentum, learning rates, and batch sizes, thus improving our understanding of momentum acceleration.

updated: Thu Jun 15 2023 09:54:21 GMT+0000 (UTC)

published: Thu Jun 15 2023 09:54:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト