Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

Soham De; Samuel L. Smith

バッチ正規化は、ディープネットワークの恒等関数に向けて残差ブロックをバイアスします

バッチ正規化は、残余ネットワークのトレーニング可能な最大の深さを劇的に増加させます。この利点は、幅広いベンチマークでの深い残余ネットワークの経験的成功にとって重要です。初期化時に、バッチの正規化により、ネットワーク接続の深さの平方根の正規化係数によって、スキップ接続に対して残差ブランチがダウンスケールされるため、この重要な利点が生じることを示します。これにより、トレーニングの早い段階で、ディープネットワークの正規化された残差ブロックによって計算される関数が（平均して）識別関数に近くなります。この洞察を使用して、正規化せずに深い残余ネットワークをトレーニングできる単純な初期化スキームを開発します。また、残余ネットワークの詳細な実証的研究も提供します。これにより、バッチ正規化ネットワークはより高い学習率でトレーニングできますが、この効果は特定のコンピューティング体制でのみ有益であり、バッチサイズが小さい場合はほとんどメリットがありません。

Batch normalization dramatically increases the largest trainable depth of residual networks, and this benefit has been crucial to the empirical success of deep residual networks on a wide range of benchmarks. We show that this key benefit arises because, at initialization, batch normalization downscales the residual branch relative to the skip connection, by a normalizing factor on the order of the square root of the network depth. This ensures that, early in training, the function computed by normalized residual blocks in deep networks is close to the identity function (on average). We use this insight to develop a simple initialization scheme that can train deep residual networks without normalization. We also provide a detailed empirical study of residual networks, which clarifies that, although batch normalized networks can be trained with larger learning rates, this effect is only beneficial in specific compute regimes, and has minimal benefits when the batch size is small.

updated: Wed Dec 09 2020 10:18:10 GMT+0000 (UTC)

published: Mon Feb 24 2020 18:43:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト