How to Train Vision Transformer on Small-scale Datasets?

Hanan Gani; Muzammal Naseer; Mohammad Yaqub

小規模なデータセットで Vision Transformer をトレーニングする方法は?

畳み込みニューラルネットワークとは根本的に異なるアーキテクチャであるビジョントランスフォーマー (ViT) は、設計のシンプルさ、堅牢性、多くのビジョンタスクにおける最先端のパフォーマンスなど、複数の利点を提供します。ただし、畳み込みニューラルネットワークとは対照的に、Vision Transformer には固有の誘導バイアスがありません。したがって、このようなモデルのトレーニングの成功は、主に、1.2M の ImageNet や 300M の画像を使用する JFT などの大規模なデータセットでの事前トレーニングに起因します。これは、Vision Transformer を小規模なデータセットに直接適応させることを妨げます。この作業では、自己管理された誘導バイアスが小規模なデータセットから直接学習でき、微調整のための効果的な重み初期化スキームとして機能することを示します。これにより、大規模な事前トレーニング、モデルアーキテクチャの変更、または損失関数を使用せずに、これらのモデルをトレーニングできます。 CIFAR10/100、CINIC10、SVHN、Tiny-ImageNet、および 2 つのきめの細かいデータセット (航空機と自動車) を含む 5 つの小さなデータセットで、モノリシックおよび非モノリシックのビジョントランスフォーマーを正常にトレーニングするための徹底的な実験を紹介します。私たちのアプローチは、顕著な領域への注意やより高い堅牢性などの特性を維持しながら、ビジョントランスフォーマーのパフォーマンスを一貫して向上させます。コードと事前トレーニング済みのモデルは、https://github.com/hananshafi/vits-for-small-scale-datasets で入手できます。

Vision Transformer (ViT), a radically different architecture than convolutional neural networks offers multiple advantages including design simplicity, robustness and state-of-the-art performance on many vision tasks. However, in contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases. Therefore, successful training of such models is mainly attributed to pre-training on large-scale datasets such as ImageNet with 1.2M or JFT with 300M images. This hinders the direct adaption of Vision Transformer for small-scale datasets. In this work, we show that self-supervised inductive biases can be learned directly from small-scale datasets and serve as an effective weight initialization scheme for fine-tuning. This allows to train these models without large-scale pre-training, changes to model architecture or loss functions. We present thorough experiments to successfully train monolithic and non-monolithic Vision Transformers on five small datasets including CIFAR10/100, CINIC10, SVHN, Tiny-ImageNet and two fine-grained datasets: Aircraft and Cars. Our approach consistently improves the performance of Vision Transformers while retaining their properties such as attention to salient regions and higher robustness. Our codes and pre-trained models are available at: https://github.com/hananshafi/vits-for-small-scale-datasets.

updated: Thu Oct 13 2022 17:59:19 GMT+0000 (UTC)

published: Thu Oct 13 2022 17:59:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト