Improve Vision Transformers Training by Suppressing Over-smoothing

Chengyue Gong; Dilin Wang; Meng Li; Vikas Chandra; Qiang Liu

過度の平滑化を抑制することにより、ビジョントランスフォーマーのトレーニングを改善します

コンピュータービジョンタスクに変圧器構造を導入すると、従来の畳み込みネットワークよりも速度と精度のトレードオフが向上する可能性があります。ただし、ビジョンタスクでバニラトランスフォーマーを直接トレーニングすると、不安定で最適ではない結果が得られることが示されています。その結果、最近の研究では、畳み込み層を組み込んで変圧器の構造を変更し、ビジョンタスクのパフォーマンスを向上させることが提案されています。この作業では、特別な構造変更を行わずにビジョントランスフォーマーのトレーニングを安定させる方法を調査します。視覚タスクでの変圧器トレーニングの不安定性は、過度の平滑化の問題に起因する可能性があり、自己注意層は入力画像からのさまざまなパッチを同様の潜在的表現にマッピングする傾向があるため、情報が失われ、特にレイヤー数が多い場合のパフォーマンスの低下。次に、多様性を促進し、情報の損失を防ぎ、Cutmixの追加のパッチ分類損失によって異なるパッチを区別するための追加の損失関数を導入するなど、この問題を軽減するためのいくつかの手法を提案します。提案された手法がトレーニングを安定させ、より広く深いビジョントランスフォーマーをトレーニングできることを示し、追加の教師や追加の畳み込みレイヤーを導入することなく、ImageNet検証セットで85.0％のトップ1精度を達成します。私たちのコードはhttps://github.com/ChengyueGongR/PatchVisionTransformerで公開されます。

Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks. However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results. As a result, recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks. This work investigates how to stabilize the training of vision transformers without special structure modification. We observe that the instability of transformer training on vision tasks can be attributed to the over-smoothing problem, that the self-attention layers tend to map the different patches from the input image into a similar latent representation, hence yielding the loss of information and degeneration of performance, especially when the number of layers is large. We then propose a number of techniques to alleviate this problem, including introducing additional loss functions to encourage diversity, prevent loss of information, and discriminate different patches by additional patch classification loss for Cutmix. We show that our proposed techniques stabilize the training and allow us to train wider and deeper vision transformers, achieving 85.0% top-1 accuracy on ImageNet validation set without introducing extra teachers or additional convolution layers. Our code will be made publicly available at https://github.com/ChengyueGongR/PatchVisionTransformer .

updated: Mon Apr 26 2021 17:43:04 GMT+0000 (UTC)

published: Mon Apr 26 2021 17:43:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト