Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Li Yuan; Yunpeng Chen; Tao Wang; Weihao Yu; Yujun Shi; Zihang Jiang; Francis EH Tay; Jiashi Feng; Shuicheng Yan

Tokens-to-Token ViT：ImageNetで最初からビジョントランスフォーマーをトレーニングする

言語モデリングで人気のあるトランスフォーマーは、最近、視覚タスクを解決するために検討されています。たとえば、画像分類用のビジョントランスフォーマー（ViT）です。 ViTモデルは、各画像を固定長のトークンのシーケンスに分割し、複数のTransformerレイヤーを適用して、分類のためにそれらのグローバルな関係をモデル化します。ただし、ViTは、ImageNetなどの中規模データセットでゼロからトレーニングすると、CNNよりもパフォーマンスが低下します。これは次の理由によるものです。1）入力画像の単純なトークン化では、隣接するピクセル間のエッジやラインなどの重要なローカル構造をモデル化できず、トレーニングサンプルの効率が低くなります。 2）ViTの冗長なアテンションバックボーン設計により、固定の計算予算と限られたトレーニングサンプルの機能が制限されます。このような制限を克服するために、新しいTokens-To-Token Vision Transformer（T2T-ViT）を提案します。これには、1）レイヤーごとのTokens-to-Token（T2T）変換が組み込まれており、隣接するものを再帰的に集約することにより、画像をトークンに段階的に構造化します。トークンを1つのトークン（Tokens-to-Token）に分割します。これにより、周囲のトークンによって表されるローカル構造をモデル化し、トークンの長さを短縮できます。 2）実証的研究の後、CNNアーキテクチャ設計によって動機付けられたビジョントランスフォーマー用の深く狭い構造を持つ効率的なバックボーン。特に、T2T-ViTは、バニラViTのパラメーター数とMACを半分に減らし、ImageNetで最初からトレーニングすると3.0％以上の改善を達成します。また、ImageNetで直接トレーニングすることにより、ResNetを上回り、MobileNetと同等のパフォーマンスを実現します。たとえば、ResNet50（21.5Mパラメータ）に匹敵するサイズのT2T-ViTは、ImageNetの画像解像度384×384で83.3％のtop1精度を達成できます。（コード：https：//github.com/yitu-opensource/T2T-ViT）

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g. , the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384×384 on ImageNet. (Code: https://github.com/yitu-opensource/T2T-ViT)

updated: Mon Mar 22 2021 11:58:10 GMT+0000 (UTC)

published: Thu Jan 28 2021 13:25:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト