Scaling Vision Transformers

Xiaohua Zhai; Alexander Kolesnikov; Neil Houlsby; Lucas Beyer

スケーリングビジョントランスフォーマー

Vision Transformer (ViT) などのアテンションベースのニューラルネットワークは、最近、多くのコンピュータービジョンベンチマークで最先端の結果を達成しています。スケールは優れた結果を達成するための主要な要素であるため、モデルのスケール特性を理解することは、将来の世代を効果的に設計するための鍵となります。 Transformer 言語モデルのスケーリングに関する法則は研究されていますが、Vision Transformers がどのようにスケーリングするかは不明です。これに対処するために、ViT モデルとデータをスケールアップおよびスケールダウンし、エラー率、データ、およびコンピューティング間の関係を特徴付けます。その過程で、ViT のアーキテクチャとトレーニングを改良し、メモリ消費を削減し、結果として得られるモデルの精度を高めます。その結果、20 億のパラメーターを使用して ViT モデルを正常にトレーニングし、ImageNet で 90.45% のトップ 1 の精度という新しい最先端を達成しました。このモデルは、少数ショット学習でもうまく機能します。たとえば、クラスあたりわずか 10 の例で ImageNet で 84.86% のトップ 1 精度を達成します。

Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model's scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale. To address this, we scale ViT models and data, both up and down, and characterize the relationships between error rate, data, and compute. Along the way, we refine the architecture and training of ViT, reducing memory consumption and increasing accuracy the resulting models. As a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.

updated: Tue Jun 08 2021 17:47:39 GMT+0000 (UTC)

published: Tue Jun 08 2021 17:47:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト