SERF: Towards better training of deep neural networks using log-Softplus ERror activation Function

Sayan Nag; Mayukh Bhattacharyya

SERF：log-SoftplusERror活性化関数を使用したディープニューラルネットワークのより良いトレーニングに向けて

活性化関数は、トレーニングのダイナミクスとニューラルネットワークのパフォーマンスを決定する上で極めて重要な役割を果たします。広く採用されている活性化関数ReLUは、シンプルで効果的であるにもかかわらず、DyingReLUの問題を含むいくつかの欠点があります。このような問題に取り組むために、自己正則化された非単調な性質を持つSerfと呼ばれる新しい活性化関数を提案します。 Mishと同様に、SerfもSwishファミリーの関数に属しています。さまざまな最先端のアーキテクチャを使用したコンピュータービジョン（画像分類とオブジェクト検出）および自然言語処理（機械翻訳、感情分類、マルチモーダル含意）タスクに関するいくつかの実験に基づいて、SerfはReLU（ベースライン）を大幅に上回っています。）およびSwishとMishの両方を含むその他のアクティブ化関数で、より深いアーキテクチャでは著しく大きなマージンがあります。アブレーション研究はさらに、SerfベースのアーキテクチャがさまざまなシナリオでSwishおよびMishのアーキテクチャよりも優れていることを示し、さまざまな深さ、複雑さ、オプティマイザ、学習率、バッチサイズ、初期化子、およびドロップアウト率でSerfの有効性と互換性を検証します。最後に、SwishとSerfの数学的関係を調査し、Serfの一次導関数に組み込まれた前処理関数の影響を示します。これにより、勾配がよりスムーズになり、最適化がより速くなります。

Activation functions play a pivotal role in determining the training dynamics and neural network performance. The widely adopted activation function ReLU despite being simple and effective has few disadvantages including the Dying ReLU problem. In order to tackle such problems, we propose a novel activation function called Serf which is self-regularized and nonmonotonic in nature. Like Mish, Serf also belongs to the Swish family of functions. Based on several experiments on computer vision (image classification and object detection) and natural language processing (machine translation, sentiment classification and multimodal entailment) tasks with different state-of-the-art architectures, it is observed that Serf vastly outperforms ReLU (baseline) and other activation functions including both Swish and Mish, with a markedly bigger margin on deeper architectures. Ablation studies further demonstrate that Serf based architectures perform better than those of Swish and Mish in varying scenarios, validating the effectiveness and compatibility of Serf with varying depth, complexity, optimizers, learning rates, batch sizes, initializers and dropout rates. Finally, we investigate the mathematical relation between Swish and Serf, thereby showing the impact of preconditioner function ingrained in the first derivative of Serf which provides a regularization effect making gradients smoother and optimization faster.

updated: Tue Aug 24 2021 05:39:22 GMT+0000 (UTC)

published: Sat Aug 21 2021 23:33:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト