Rethinking Spatial Dimensions of Vision Transformers

Byeongho Heo; Sangdoo Yun; Dongyoon Han; Sanghyuk Chun; Junsuk Choe; Seong Joon Oh

ビジョントランスフォーマーの空間寸法の再考

Vision Transformer（ViT）は、既存の畳み込みニューラルネットワーク（CNN）に対する代替アーキテクチャとして、トランスフォーマーのアプリケーション範囲を言語処理からコンピュータービジョンタスクに拡張します。トランスベースのアーキテクチャはコンピュータビジョンモデリングにとって革新的であるため、効果的なアーキテクチャに向けた設計規則はまだあまり研究されていません。 CNNの成功した設計原理から、トランスベースのアーキテクチャにおける空間次元変換の役割とその有効性を調査します。特に、CNNの次元削減の原則に注意を払います。深さが増すにつれて、従来のCNNはチャネルの次元を増やし、空間の次元を減らします。このような空間次元削減がトランスフォーマーアーキテクチャにも有益であることを経験的に示し、元のViTモデルに基づいて新しいプーリングベースのビジョントランスフォーマー（PiT）を提案します。 PiTがViTに対して改善されたモデル機能と一般化パフォーマンスを実現することを示します。広範な実験を通じて、PiTが画像分類、オブジェクト検出、堅牢性評価などのいくつかのタスクでベースラインを上回っていることをさらに示しています。ソースコードとImageNetモデルはhttps://github.com/naver-ai/pitで入手できます。

Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection, and robustness evaluation. Source codes and ImageNet models are available at https://github.com/naver-ai/pit

updated: Wed Aug 18 2021 03:47:24 GMT+0000 (UTC)

published: Tue Mar 30 2021 12:51:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト