AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Juntang Zhuang; Tommy Tang; Yifan Ding; Sekhar Tatikonda; Nicha Dvornek; Xenophon Papademetris; James S. Duncan

AdaBeliefオプティマイザ: 観測された勾配の信念によるステップサイズの適応

深層学習のための最も一般的なオプティマイザは、適応型手法(例: Adam)と加速型スキーム(例: 運動量を伴う確率的勾配降下(SGD))に大別される。畳み込みニューラルネットワーク(CNN)のような多くのモデルでは、適応的手法は通常、収束は速いが一般化はSGDに比べて悪く、生成的逆境ネットワーク(GAN)のような複雑な設定では、その安定性から適応的手法が一般的にデフォルトとなっている。我々は、適応的手法のような高速な収束、SGDのような良好な一般化、学習の安定性という3つの目標を同時に達成するために、AdaBeliefを提案する。AdaBeliefの直感は、現在の勾配方向の「信念」に応じてステップサイズを適応させることである。ノイズの多い勾配の指数移動平均(EMA)を次の時間ステップでの勾配の予測と見なし、観測された勾配が予測から大きく乖離している場合は、現在の観測を不信にして小さなステップをとり、観測された勾配が予測に近い場合は、それを信頼して大きなステップをとる。我々は大規模な実験でAdaBeliefを検証し、画像分類や言語モデル化において、他の手法よりも収束が速く、高精度であることを示した。特に、ImageNet上では、AdaBeliefはSGDと同等の精度を達成した。さらに、Cifar10 上での GAN のトレーニングでは、AdaBelief は高い安定性を示し、調整された Adam オプティマイザと比較して生成されたサンプルの品質を向上させた。コードは https://github.com/juntang-zhuang/Adabelief-Optimizer から入手可能。

Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptive methods are typically the default because of their stability.We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the stepsize according to the "belief" in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step. We validate AdaBelief in extensive experiments, showing that it outperforms other methods with fast convergence and high accuracy on image classification and language modeling. Specifically, on ImageNet, AdaBelief achieves comparable accuracy to SGD. Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a well-tuned Adam optimizer. Code is available at https://github.com/juntang-zhuang/Adabelief-Optimizer

updated: Sat Nov 28 2020 03:01:41 GMT+0000 (UTC)

published: Thu Oct 15 2020 01:46:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト