DeepViT: Towards Deeper Vision Transformer

Daquan Zhou; Bingyi Kang; Xiaojie Jin; Linjie Yang; Xiaochen Lian; Zihang Jiang; Qibin Hou; Jiashi Feng

DeepViT：より深いビジョントランスフォーマーに向けて

ビジョントランスフォーマー（ViT）は、最近、画像分類タスクに正常に適用されています。このホワイトペーパーでは、より多くの畳み込み層を積み重ねることで改善できる畳み込みニューラルネットワーク（CNN）とは異なり、ViTのパフォーマンスはより深くスケーリングすると急速に飽和することを示します。より具体的には、このようなスケーリングの難しさは注意の崩壊の問題によって引き起こされることを経験的に観察します。トランスフォーマーが深くなるにつれて、注意マップは徐々に類似し、特定のレイヤーの後でほぼ同じになります。言い換えると、特徴マップは、深いViTモデルの最上層で同一である傾向があります。この事実は、ViTのより深い層では、自己注意メカニズムが表現学習の効果的な概念を学習できず、モデルが期待されるパフォーマンスの向上を得るのを妨げることを示しています。上記の観察に基づいて、再アテンションと呼ばれるシンプルで効果的な方法を提案します。アテンションマップを再生成して、計算とメモリコストを無視して、さまざまなレイヤーでの多様性を高めます。提案された方法により、既存のViTモデルにわずかな変更を加えることで、一貫したパフォーマンスの向上を実現しながら、より深いViTモデルをトレーニングすることが可能になります。特に、32個の変圧器ブロックを使用してディープViTモデルをトレーニングすると、ImageNetでトップ1の分類精度を1.6％向上させることができます。コードはhttps://github.com/zhoudaquan/dvit_repoで公開されています。

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet. Code is publicly available at https://github.com/zhoudaquan/dvit_repo.

updated: Mon Apr 19 2021 07:06:02 GMT+0000 (UTC)

published: Mon Mar 22 2021 14:32:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト