Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet

Zihang Jiang; Qibin Hou; Li Yuan; Daquan Zhou; Xiaojie Jin; Anran Wang; Jiashi Feng

トークンラベリング：ImageNetで5600万のパラメーターを使用して85.5％のトップ1精度のビジョントランスフォーマーをトレーニングする

このペーパーは、ImageNet分類タスクのビジョントランスフォーマーの強力なベースラインを提供します。最近のビジョントランスフォーマーは、ImageNet分類で有望な結果を示していますが、そのパフォーマンスは、ほぼ同じモデルサイズの強力な畳み込みニューラルネットワーク（CNN）に遅れをとっています。この作業では、新しいトランスフォーマーアーキテクチャを説明する代わりに、トレーニングテクニックのバッグを開発することにより、ImageNet分類におけるビジョントランスフォーマーの可能性を探ります。ビジョントランスフォーマーの構造をわずかに調整し、トークンラベリング（新しいトレーニング目標）を導入することで、CNNの対応するモデルや、同様の量のトレーニングパラメーターと計算を使用する他のトランスフォーマーベースの分類モデルよりも優れた結果を達成できることを示します。 26Mの学習可能なパラメーターを備えたビジョントランスフォーマーを例にとると、ImageNetで84.4％のトップ1精度を達成できます。モデルサイズを56M / 150Mに拡大すると、追加のデータなしで結果をさらに85.4％/ 86.2％に増やすことができます。この研究が、強力なビジョントランスフォーマーをトレーニングするための有用な技術を研究者に提供できることを願っています。私たちのコードとすべてのトレーニングの詳細は、https：//github.com/zihangJiang/TokenLabelingで公開されます。

This paper provides a strong baseline for vision transformers on the ImageNet classification task. While recent vision transformers have demonstrated promising results in ImageNet classification, their performance still lags behind powerful convolutional neural networks (CNNs) with approximately the same model size. In this work, instead of describing a novel transformer architecture, we explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques. We show that by slightly tuning the structure of vision transformers and introducing token labeling -- a new training objective, our models are able to achieve better results than the CNN counterparts and other transformer-based classification models with similar amount of training parameters and computations. Taking a vision transformer with 26M learnable parameters as an example, we can achieve a 84.4% Top-1 accuracy on ImageNet. When the model size is scaled up to 56M/150M, the result can be further increased to 85.4%/86.2% without extra data. We hope this study could provide researchers with useful techniques to train powerful vision transformers. Our code and all the training details will be made publicly available at https://github.com/zihangJiang/TokenLabeling.

updated: Thu Apr 22 2021 04:43:06 GMT+0000 (UTC)

published: Thu Apr 22 2021 04:43:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト