LipsFormer: Introducing Lipschitz Continuity to Vision Transformers

Xianbiao Qi; Jianan Wang; Yihao Chen; Yukai Shi; Lei Zhang

LipsFormer: ビジョントランスフォーマーに Lipschitz Continuity を導入

LipsFormer と呼ばれる Lipschitz 連続 Transformer を提示し、Transformer ベースのモデルのトレーニングの安定性を理論的および経験的に追求します。学習率のウォームアップ、レイヤーの正規化、アテンションの定式化、および重みの初期化によってトレーニングの不安定性に対処する以前の実用的なトリックとは対照的に、リプシッツの連続性がトレーニングの安定性を確保するためのより重要なプロパティであることを示します。 LipsFormer では、不安定な Transformer コンポーネントモジュールを Lipschitz 連続モジュールに置き換えます。LayerNorm の代わりに CenterNorm、Xavier 初期化の代わりにスペクトル初期化、内積注意の代わりにスケーリングされたコサイン類似度注意、および重み付き残差ショートカットです。これらの導入された加群が Lipschitz 連続であることを証明し、LipsFormer の Lipschitz 定数の上限を導出します。私たちの実験では、LipsFormer を使用すると、ウォームアップなどの慎重な学習率の調整を必要とせずに、深い Transformer アーキテクチャの安定したトレーニングが可能になり、収束が速くなり、一般化が向上することが示されています。その結果、ImageNet 1K データセットでは、300 エポックの Swin Transformer トレーニングに基づく LipsFormer-Swin-Tiny は、学習率のウォームアップなしで 82.7% を取得できます。さらに、CSwin に基づく LipsFormer-CSwin-Tiny は、300 エポックのトレーニングで、4.7G FLOP および 24M パラメータで 83.5% のトップ 1 精度を達成します。コードは https://github.com/IDEA-Research/LipsFormer で公開されます。

We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability both theoretically and empirically for Transformer-based models. In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ensure training stability. In LipsFormer, we replace unstable Transformer component modules with Lipschitz continuous counterparts: CenterNorm instead of LayerNorm, spectral initialization instead of Xavier initialization, scaled cosine similarity attention instead of dot-product attention, and weighted residual shortcut. We prove that these introduced modules are Lipschitz continuous and derive an upper bound on the Lipschitz constant of LipsFormer. Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning such as warmup, yielding a faster convergence and better generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Swin-Tiny based on Swin Transformer training for 300 epochs can obtain 82.7% without any learning rate warmup. Moreover, LipsFormer-CSwin-Tiny, based on CSwin, training for 300 epochs achieves a top-1 accuracy of 83.5% with 4.7G FLOPs and 24M parameters. The code will be released at https://github.com/IDEA-Research/LipsFormer.

updated: Wed Apr 19 2023 17:59:39 GMT+0000 (UTC)

published: Wed Apr 19 2023 17:59:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト