diffGrad: An Optimization Method for Convolutional Neural Networks

Shiv Ram Dubey; Soumendu Chakraborty; Swalpa Kumar Roy; Snehasis Mukherjee; Satish Kumar Singh; Bidyut Baran Chaudhuri

diffGrad：畳み込みニューラルネットワークの最適化手法

確率的勾配降下（SGD）は、ディープニューラルネットワークの成功の背後にあるコアテクニックの1つです。勾配は、関数が最も急な変化率を持つ方向に関する情報を提供します。基本的なSGDの主な問題は、勾配の振る舞いに関係なく、すべてのパラメーターで同じサイズのステップで変更することです。したがって、ネットワークを深く最適化する効率的な方法は、各パラメーターの適応ステップサイズを作成することです。最近、AdaGrad、AdaDelta、RMSProp、Adamなどの勾配降下法を改善するためのいくつかの試みが行われました。これらの方法は、過去の2乗勾配の指数移動平均の平方根に依存しています。したがって、これらの方法は、勾配の局所的な変化を利用しません。このホワイトペーパーでは、現在の勾配と直前の勾配との差（つまり、diffGrad）に基づいて新しいオプティマイザーを提案しています。提案されているdiffGrad最適化手法では、各パラメーターのステップサイズを調整して、勾配変更パラメーターを高速化するためにステップサイズを大きくし、勾配変更パラメーターを低くするためにステップサイズを小さくする必要があります。コンバージェンス分析は、オンライン学習フレームワークの後悔のアプローチを使用して行われます。このホワイトペーパーでは、3つの合成複合非凸関数について厳密な分析を行っています。画像分類実験もCIFAR10およびCIFAR100データセットで実施され、SGDM、AdaGrad、AdaDelta、RMSProp、AMSGrad、Adamなどの最先端のオプティマイザーに関するdiffGradのパフォーマンスを観察します。実験では、残差ユニット（ResNet）ベースの畳み込みニューラルネットワーク（CNN）アーキテクチャが使用されます。この実験は、diffGradが他のオプティマイザーよりも優れていることを示しています。また、diffGradは、異なるアクティベーション関数を使用してCNNをトレーニングするために均一に機能することを示します。ソースコードはhttps://github.com/shivram1987/diffGradで公開されています。

Stochastic Gradient Decent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal sized steps for all parameters, irrespective of gradient behavior. Hence, an efficient way of deep network optimization is to make adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp and Adam. These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this paper, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of online learning framework. Rigorous analysis is made in this paper over three synthetic complex non-convex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 datasets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet) based Convolutional Neural Networks (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad.

updated: Fri Mar 06 2020 06:51:39 GMT+0000 (UTC)

published: Thu Sep 12 2019 06:20:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト