All Tokens Matter: Token Labeling for Training Better Vision Transformers

Zihang Jiang; Qibin Hou; Li Yuan; Daquan Zhou; Yujun Shi; Xiaojie Jin; Anran Wang; Jiashi Feng

すべてのトークンが重要: より優れたビジョントランスフォーマーをトレーニングするためのトークンのラベル付け

このホワイトペーパーでは、トークンのラベル付けを紹介します。これは、高性能ビジョントランスフォーマー (ViT) をトレーニングするための新しいトレーニング目標です。追加のトレーニング可能なクラストークンで分類損失を計算する ViT の標準的なトレーニング目標とは異なり、私たちが提案するものは、すべての画像パッチトークンを利用して、トレーニング損失を高密度に計算します。具体的には、トークンのラベル付けは、画像分類問題を複数のトークンレベルの認識問題に再定式化し、各パッチトークンに、マシンアノテーターによって生成された個々の場所固有の監視を割り当てます。実験によると、トークンのラベル付けは、幅広いスペクトルにわたってさまざまな ViT モデルのパフォーマンスを明確かつ一貫して向上させることができます。例として使用できる 26M の学習可能なパラメーターを持つビジョントランスフォーマーの場合、トークンのラベル付けを使用すると、モデルは ImageNet で 84.4% のトップ 1 精度を達成できます。モデルサイズを 150M までわずかに拡大することで、結果はさらに 86.4% に向上し、86% に達する以前のモデル (250M+) の最小サイズのモデルを提供します。また、トークンのラベル付けが、セマンティックセグメンテーションなどの高密度予測を伴うダウンストリームタスクでの事前トレーニング済みモデルの一般化機能を明らかに改善できることも示します。私たちのコードとすべてのトレーニングの詳細は、https://github.com/zihangJiang/TokenLabeling で公開されます。

In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The result can be further increased to 86.4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pre-trained models on downstream tasks with dense prediction, such as semantic segmentation. Our code and all the training details will be made publicly available at https://github.com/zihangJiang/TokenLabeling.

updated: Wed Jun 09 2021 15:27:26 GMT+0000 (UTC)

published: Thu Apr 22 2021 04:43:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト