Stabilizing Transformer Training by Preventing Attention Entropy Collapse

Shuangfei Zhai; Tatiana Likhomanenko; Etai Littwin; Dan Busbridge; Jason Ramapuram; Yizhe Zhang; Jiatao Gu; Josh Susskind

Attention Entropy Collapse の防止による Transformer Training の安定化

トレーニングの安定性は、トランスフォーマーにとって非常に重要です。この作業では、注意層の進化を調べることにより、トランスフォーマーのトレーニングのダイナミクスを調査します。特に、トレーニング中に各アテンションヘッドのアテンションエントロピーを追跡します。これは、モデルの鋭さのプロキシです。さまざまなアーキテクチャとタスクに共通するパターンを特定します。低注意エントロピーは、振動損失または発散の形をとる可能性がある高いトレーニング不安定性を伴います。非常に集中した注意スコアに対応する病理学的に低い注意エントロピーを、エントロピー崩壊と呼びます。救済策として、スペクトル正規化と追加の学習スカラーを使用してすべての線形レイヤーを再パラメーター化するシンプルで効率的なソリューションである σReparam を提案します。提案された再パラメータ化が注意層のエントロピー崩壊をうまく防ぎ、より安定したトレーニングを促進することを実証します。さらに、注意エントロピーの厳密な下限を証明します。これは、注意ロジットのスペクトルノルムで指数関数的に速く減少し、アプローチの追加の動機を提供します。 Transformer アーキテクチャ全体で、画像分類、画像の自己教師あり学習、機械翻訳、自動音声認識、および言語モデリングタスクで σReparam を使用して実験を行います。 σReparam がハイパーパラメータの選択に関して安定性とロバスト性を提供することを示し、(a) ウォームアップ、重み減衰、レイヤーの正規化、または適応オプティマイザーなしでビジョントランスフォーマーを競争力のあるパフォーマンスにトレーニングできるようにします。 (b) 機械翻訳のディープアーキテクチャ、および (c) ウォーミングアップや適応オプティマイザーなしで競争力のあるパフォーマンスを実現する音声認識。

Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as entropy collapse. As a remedy, we propose σReparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that the proposed reparameterization successfully prevents entropy collapse in the attention layers, promoting more stable training. Additionally, we prove a tight lower bound of the attention entropy, which decreases exponentially fast with the spectral norm of the attention logits, providing additional motivation for our approach. We conduct experiments with σReparam on image classification, image self-supervised learning, machine translation, automatic speech recognition, and language modeling tasks, across Transformer architectures. We show that σReparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer to competitive performance without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers.

updated: Sat Mar 11 2023 03:30:47 GMT+0000 (UTC)

published: Sat Mar 11 2023 03:30:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト