Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice

Peihao Wang; Wenqing Zheng; Tianlong Chen; Zhangyang Wang

フーリエドメイン分析によるディープビジョントランスのアンチオーバースムージング：理論から実践へ

Vision Transformer（ViT）は最近、コンピュータービジョンの問題で有望であることが実証されました。ただし、畳み込みニューラルネットワーク（CNN）とは異なり、ViTのパフォーマンスは、観察された注意の崩壊またはパッチの均一性のために、深さが増すにつれて急速に飽和することが知られています。いくつかの経験的な解決策にもかかわらず、このスケーラビリティの問題を研究する厳密なフレームワークは、とらえどころのないままです。この論文では、最初に、フーリエスペクトル領域からViTの特徴を分析するための厳密な理論フレームワークを確立します。自己注意メカニズムは本質的にローパスフィルターに相当することを示します。これは、ViTがその深さをスケールアップすると、過度のローパスフィルターにより、フィーチャマップが直流（DC）コンポーネントのみを保持することを示します。次に、望ましくないローパス制限を緩和するための2つの簡単で効果的な手法を提案します。 AttnScaleと呼ばれる最初の手法は、自己注意ブロックをローパスコンポーネントとハイパスコンポーネントに分解し、次にこれら2つのフィルターを再スケーリングして組み合わせ、オールパス自己注意マトリックスを生成します。 FeatScaleと呼ばれる2番目の手法は、高周波信号を増幅するために、別々の周波数帯域でフィーチャマップを再重み付けします。どちらの手法も効率的でハイパーパラメータがなく、注意の崩壊やパッチの均一性などの関連するViTトレーニングアーティファクトを効果的に克服します。私たちの技術を複数のViTバリアントにシームレスにプラグインすることにより、ViTがより深いアーキテクチャから利益を得るのに一貫して役立ち、「無料」で最大1.1％のパフォーマンス向上をもたらすことを示します（たとえば、パラメーターのオーバーヘッドがほとんどありません）。コードと事前トレーニング済みモデルは、https：//github.com/VITA-Group/ViT-Anti-Oversmoothingで公開されています。

Vision Transformer (ViT) has recently demonstrated promise in computer vision problems. However, unlike Convolutional Neural Networks (CNN), it is known that the performance of ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity. Despite a couple of empirical solutions, a rigorous framework studying on this scalability issue remains elusive. In this paper, we first establish a rigorous theory framework to analyze ViT features from the Fourier spectrum domain. We show that the self-attention mechanism inherently amounts to a low-pass filter, which indicates when ViT scales up its depth, excessive low-pass filtering will cause feature maps to only preserve their Direct-Current (DC) component. We then propose two straightforward yet effective techniques to mitigate the undesirable low-pass limitation. The first technique, termed AttnScale, decomposes a self-attention block into low-pass and high-pass components, then rescales and combines these two filters to produce an all-pass self-attention matrix. The second technique, termed FeatScale, re-weights feature maps on separate frequency bands to amplify the high-frequency signals. Both techniques are efficient and hyperparameter-free, while effectively overcoming relevant ViT training artifacts such as attention collapse and patch uniformity. By seamlessly plugging in our techniques to multiple ViT variants, we demonstrate that they consistently help ViTs benefit from deeper architectures, bringing up to 1.1% performance gains "for free" (e.g., with little parameter overhead). We publicly release our codes and pre-trained models at https://github.com/VITA-Group/ViT-Anti-Oversmoothing.

updated: Wed Mar 09 2022 23:55:24 GMT+0000 (UTC)

published: Wed Mar 09 2022 23:55:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト