How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Andreas Steiner; Alexander Kolesnikov; Xiaohua Zhai; Ross Wightman; Jakob Uszkoreit; Lucas Beyer

ViTをトレーニングする方法は？ Vision Transformerのデータ、拡張、および正則化

ビジョントランスフォーマー（ViT）は、画像分類、オブジェクト検出、セマンティック画像セグメンテーションなど、幅広いビジョンアプリケーションで非常に競争力のあるパフォーマンスを実現することが示されています。畳み込みニューラルネットワークと比較して、Vision Transformerの弱い誘導バイアスは、一般に、より小さなトレーニングデータセットでトレーニングするときに、モデルの正則化またはデータ拡張（略して「AugReg」）への依存度を高めることがわかっています。トレーニングデータの量、AugReg、モデルサイズ、および計算予算の間の相互作用をよりよく理解するために、体系的な実証研究を実施します。この調査の結果、計算量の増加とAugRegの組み合わせにより、1桁多くのトレーニングデータでトレーニングされたモデルと同じパフォーマンスのモデルが得られることがわかりました。さまざまなサイズのViTモデルをパブリックImageNet-21kデータセットでトレーニングします。より大きな、しかし公開されていないJFT-300Mデータセットでトレーニングされた対応するものと一致するか、それを上回ります。

Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation (``AugReg'' for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.

updated: Fri Jun 18 2021 17:58:20 GMT+0000 (UTC)

published: Fri Jun 18 2021 17:58:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト