When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Xiangning Chen; Cho-Jui Hsieh; Boqing Gong

事前トレーニングや強力なデータ拡張なしでビジョントランスフォーマーが ResNet を上回る場合

ビジョントランスフォーマー (ViT) と MLP は、手動配線機能または誘導バイアスを汎用ニューラルアーキテクチャに置き換えるためのさらなる努力を示しています。既存の作品は、大規模な事前トレーニングや強力なデータ拡張の繰り返しなど、大量のデータによってモデルを強化し、最適化に関連する問題 (初期化に対する感度や学習率など) を引き続き報告します。したがって、このペーパーでは、トレーニングでのモデルのデータ効率を改善し、推論で一般化することを目的として、損失ジオメトリのレンズから ViT と MLP-Mixer を調査します。可視化とヘッシアンは、収束モデルの非常に鋭い局所的最小値を明らかにします。最近提案されたシャープネス認識オプティマイザーでスムーズさを促進することにより、教師あり学習、敵対的学習、対照的学習、転移学習にまたがるさまざまなタスクでの ViT と MLP-Mixer の精度と堅牢性を大幅に改善します (たとえば、+5.3% と +11.0%単純な Inception スタイルの前処理を使用した、ViT-B/16 および Mixer-B/16 の ImageNet でそれぞれ 1 の精度)。改善された滑らかさは、最初の数層のアクティブなニューロンがまばらであることに起因することを示しています。結果として得られる ViT は、大規模な事前トレーニングや強力なデータ拡張を行わずに ImageNet でゼロからトレーニングした場合、同様のサイズとスループットの ResNet よりも優れています。彼らはまた、より知覚的な注意マップを持っています。

Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pretraining and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rate). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3% and +11.0% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations. They also possess more perceptive attention maps.

updated: Thu Jun 03 2021 02:08:03 GMT+0000 (UTC)

published: Thu Jun 03 2021 02:08:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト