Scalable Vision Transformers with Hierarchical Pooling

Zizheng Pan; Bohan Zhuang; Jing Liu; Haoyu He; Jianfei Cai

階層型プーリングを備えたスケーラブルなビジョントランスフォーマー

最近提案された純粋な注意を払ったビジュアルイメージトランスフォーマー（ViT）は、画像分類などの画像認識タスクで有望なパフォーマンスを実現しました。ただし、現在のViTモデルのルーチンは、推論中に完全長のパッチシーケンスを維持することです。これは冗長であり、階層表現がありません。この目的のために、畳み込みニューラルネットワーク（CNN）の機能マップのダウンサンプリングと同様に、ビジュアルトークンを段階的にプールしてシーケンスの長さを短縮し、計算コストを削減する階層型ビジュアルトランスフォーマー（HVT）を提案します。シーケンスの長さが短くなるために計算が複雑になることなく、深度/幅/解像度/パッチサイズの次元をスケーリングすることでモデルの容量を増やすことができるという大きなメリットがあります。さらに、経験的に、平均的なプールされたビジュアルトークンには、単一のクラストークンよりも多くの識別情報が含まれていることがわかります。 HVTのスケーラビリティの向上を実証するために、画像分類タスクで広範な実験を行います。同等のFLOPを備えた当社のHVTは、ImageNetおよびCIFAR-100データセットの競合ベースラインを上回っています。コードはhttps://github.com/MonashAI/HVTで入手できます。

The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. To this end, we propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs). It brings a great benefit that we can increase the model capacity by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity due to the reduced sequence length. Moreover, we empirically find that the average pooled visual tokens contain more discriminative information than the single class token. To demonstrate the improved scalability of our HVT, we conduct extensive experiments on the image classification task. With comparable FLOPs, our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets. Code is available at https://github.com/MonashAI/HVT

updated: Wed Aug 18 2021 10:18:22 GMT+0000 (UTC)

published: Fri Mar 19 2021 03:55:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト