Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

Shulun Wang; Bin Liu; Feng Liu

勾配消失の回避：注意メカニズムにおけるSoftmaxの定期的な代替案

Softmaxは、マルチクラス分類、ゲート構造、および注意メカニズムのためにニューラルネットワークで広く使用されています。入力が正規分布であるという統計的仮定は、Softmaxの勾配安定性をサポートします。ただし、トランスフォーマーなどの注意メカニズムで使用する場合、埋め込み間の相関スコアが正規分布しないことが多いため、勾配消失問題が発生し、実験的な確認を通じてこの点を証明します。この作業では、指数関数を周期関数に置き換えることを提案し、値と勾配の観点からSoftmaxのいくつかの潜在的な周期的代替案を掘り下げます。 LeViTを参照して単純に設計されたデモでの実験を通じて、私たちの方法は勾配問題を軽減し、Softmaxとその変形と比較して大幅な改善をもたらすことができることが証明されています。さらに、数学と実験を通じて、Softmaxの事前正規化と私たちの方法の影響を分析します。最後に、デモの深さを増やし、深層構造でのメソッドの適用可能性を証明します。

Softmax is widely used in neural networks for multiclass classification, gate structure and attention mechanisms. The statistical assumption that the input is normal distributed supports the gradient stability of Softmax. However, when used in attention mechanisms such as transformers, since the correlation scores between embeddings are often not normally distributed, the gradient vanishing problem appears, and we prove this point through experimental confirmation. In this work, we suggest that replacing the exponential function by periodic functions, and we delve into some potential periodic alternatives of Softmax from the view of value and gradient. Through experiments on a simply designed demo referenced to LeViT, our method is proved to be able to alleviate the gradient problem and yield substantial improvements compared to Softmax and its variants. Further, we analyze the impact of pre-normalization for Softmax and our methods through mathematics and experiments. Lastly, we increase the depth of the demo and prove the applicability of our method in deep structures.

updated: Mon Aug 16 2021 15:26:31 GMT+0000 (UTC)

published: Mon Aug 16 2021 15:26:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト