Stabilizing Transformer Training by Preventing Attention Entropy Collapse

Shuangfei Zhai; Tatiana Likhomanenko; Etai Littwin; Dan Busbridge; Jason Ramapuram; Yizhe Zhang; Jiatao Gu; Josh Susskind

注意エントロピー崩壊を防止してトランスフォーマートレーニングを安定化

トレーニングの安定性はトランスフォーマーにとって非常に重要です。この研究では、注目層の進化を調べることによって、トランスフォーマーのトレーニングのダイナミクスを調査します。特に、トレーニング中に各アテンションヘッドのアテンションエントロピーを追跡します。これは、モデルの鮮明さの代用となります。私たちは、さまざまなアーキテクチャやタスクにまたがる共通のパターンを特定しました。このパターンでは、注意エントロピーが低いとトレーニングの高い不安定性が伴い、損失や発散が振動する可能性があります。高度に集中した注意スコアに対応する、病的に低い注意エントロピーをエントロピー崩壊と呼びます。解決策として、スペクトル正規化と追加の学習スカラーを使用してすべての線形層を再パラメータ化するシンプルで効率的なソリューションである σReparam を提案します。 σReparam が注意層のエントロピー崩壊をうまく防止し、より安定したトレーニングを促進することを示します。さらに、注意エントロピーの厳しい下限が証明され、これは注意ロジットのスペクトルノルムとともに指数関数的に急速に減少し、私たちのアプローチにさらなる動機を与えます。画像分類、画像の自己教師あり学習、機械翻訳、音声認識、言語モデリングのタスクについて σReparam を使用した実験を行います。我々は、σReparam がハイパーパラメータの選択に関して安定性と堅牢性を提供し、(a) ウォームアップ、重み減衰、層正規化、または適応オプティマイザなしでビジョントランスフォーマーを競争力のあるパフォーマンスにトレーニングできることまでを示します。 (b) 機械翻訳のディープアーキテクチャ、および (c) ウォームアップや適応オプティマイザを使用せずに競争力のあるパフォーマンスを実現する音声認識。コードは https://github.com/apple/ml-sigma-reparam で入手できます。

Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as entropy collapse. As a remedy, we propose σReparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that σReparam successfully prevents entropy collapse in the attention layers, promoting more stable training. Additionally, we prove a tight lower bound of the attention entropy, which decreases exponentially fast with the spectral norm of the attention logits, providing additional motivation for our approach. We conduct experiments with σReparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks. We show that σReparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer to competitive performance without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers. Code is available at https://github.com/apple/ml-sigma-reparam.

updated: Tue Jul 25 2023 17:42:37 GMT+0000 (UTC)

published: Sat Mar 11 2023 03:30:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト