Beyond BatchNorm: Towards a General Understanding of Normalization in Deep Learning

Ekdeep Singh Lubana; Robert P. Dick; Hidenori Tanaka

BatchNormを超えて：ディープラーニングにおける正規化の一般的な理解に向けて

BatchNormに触発されて、ディープニューラルネットワーク（DNN）の正規化レイヤーが爆発的に増加しました。ただし、これらの代替正規化レイヤーの使用は最小限に抑えられています。これは、これらのレイヤーがBatchNormの代わりとして機能するタイミングを特定するのに役立つガイド原則がないためです。この問題に対処するために、BatchNormの既知の有益なメカニズムを、最近提案されたいくつかの正規化手法に一般化する理論的アプローチを採用しています。私たちの一般化された理論は、次の一連の原則につながります。（i）BatchNormと同様に、アクティベーションベースの正規化レイヤーはResNetでのアクティベーションの指数関数的成長を防ぐことができますが、パラメトリックレイヤーには明示的な救済策が必要です。（ii）GroupNormを使用すると、異なるサンプルに異なるアクティベーションが割り当てられ、有益な順伝播が保証されますが、グループサイズを大きくすると、異なるサンプルのアクティベーションがますます区別できなくなり、LayerNormを使用したモデルの収束速度が遅くなります。（iii）グループサイズが小さいと、初期のレイヤーで勾配ノルムが大きくなるため、インスタンスの正規化でのトレーニングの不安定性の問題を説明し、GroupNormでの速度と安定性のトレードオフを示します。全体として、私たちの分析は、深層学習における正規化手法の成功を支える統一された一連のメカニズムを明らかにし、DNN正規化レイヤーの広大な設計空間を体系的に探索するためのコンパスを提供します。

Inspired by BatchNorm, there has been an explosion of normalization layers for deep neural networks (DNNs). However, these alternative normalization layers have seen minimal use, partially due to a lack of guiding principles that can help identify when these layers can serve as a replacement for BatchNorm. To address this problem, we take a theoretical approach, generalizing the known beneficial mechanisms of BatchNorm to several recently proposed normalization techniques. Our generalized theory leads to the following set of principles: (i) similar to BatchNorm, activations-based normalization layers can prevent exponential growth of activations in ResNets, but parametric layers require explicit remedies; (ii) use of GroupNorm can ensure informative forward propagation, with different samples being assigned dissimilar activations, but increasing group size results in increasingly indistinguishable activations for different samples, explaining slow convergence speed in models with LayerNorm; (iii) small group sizes result in large gradient norm in earlier layers, hence explaining training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals a unified set of mechanisms that underpin the success of normalization methods in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers.

updated: Thu Jul 08 2021 17:57:01 GMT+0000 (UTC)

published: Thu Jun 10 2021 17:51:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト