So-ViT: Mind Visual Tokens for Vision Transformer

Jiangtao Xie; Ruiren Zeng; Qilong Wang; Ziqi Zhou; Peihua Li

So-ViT：VisionTransformerのマインドビジュアルトークン

最近、バックボーンが純粋に自己注意メカニズムで構成されているビジョントランスフォーマー（ViT）アーキテクチャは、視覚分類において非常に有望なパフォーマンスを達成しています。ただし、元のViTの高性能は、超大規模データセットを使用した事前トレーニングに大きく依存しており、最初からトレーニングした場合、ImageNet-1Kでは大幅にパフォーマンスが低下します。このホワイトペーパーでは、ビジュアルトークンの役割を慎重に検討することにより、この問題の解決に向けて努力しています。まず、分類ヘッドの場合、既存のViTはクラストークンのみを利用し、高レベルのビジュアルトークンに固有の豊富なセマンティック情報を完全に無視します。したがって、新しい分類パラダイムを提案します。このパラダイムでは、視覚トークンの2次の相互共分散プーリングが、最終的な分類のためにクラストークンと組み合わされます。一方、2次プーリングを改善するために、高速特異値電力正規化が提案されています。第2に、元のViTは、固定サイズの画像パッチの素朴な埋め込みを採用しており、並進の同変性と局所性をモデル化する機能がありません。この問題を軽減するために、ビジュアルトークン埋め込み用の既製の畳み込みに基づく軽量の階層モジュールを開発します。 So-ViTと呼ばれる提案されたアーキテクチャは、ImageNet-1Kで徹底的に評価されます。結果は、ゼロからトレーニングした場合、最先端のCNNモデルと同等かそれ以上でありながら、競合するViTバリアントよりも優れたパフォーマンスを発揮することを示しています。コードはhttps://github.com/jiangtaoxie/So-ViTで入手できます

Recently the vision transformer (ViT) architecture, where the backbone purely consists of self-attention mechanism, has achieved very promising performance in visual classification. However, the high performance of the original ViT heavily depends on pretraining using ultra large-scale datasets, and it significantly underperforms on ImageNet-1K if trained from scratch. This paper makes the efforts toward addressing this problem, by carefully considering the role of visual tokens. First, for classification head, existing ViT only exploits class token while entirely neglecting rich semantic information inherent in high-level visual tokens. Therefore, we propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification. Meanwhile, a fast singular value power normalization is proposed for improving the second-order pooling. Second, the original ViT employs the naive embedding of fixed-size image patches, lacking the ability to model translation equivariance and locality. To alleviate this problem, we develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding. The proposed architecture, which we call So-ViT, is thoroughly evaluated on ImageNet-1K. The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models. Code is available at https://github.com/jiangtaoxie/So-ViT

updated: Thu Apr 22 2021 09:05:09 GMT+0000 (UTC)

published: Thu Apr 22 2021 09:05:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト