When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

Xiangning Chen; Cho-Jui Hsieh; Boqing Gong

ビジョントランスフォーマーが事前トレーニングや強力なデータ拡張なしでResNetを上回る場合

ビジョントランスフォーマー（ViT）およびMLPは、手作業で配線された機能または誘導バイアスを汎用ニューラルアーキテクチャに置き換えるためのさらなる取り組みを示しています。既存の作業は、大規模な事前トレーニングや繰り返しの強力なデータ拡張などの大量のデータによってモデルを強化し、最適化関連の問題（たとえば、初期化や学習率に対する感度）を報告します。したがって、このペーパーでは、損失ジオメトリのレンズからViTとMLPミキサーを調査し、トレーニング時のモデルのデータ効率と推論時の一般化を改善することを目的としています。可視化とヘッセ行列は、収束モデルの非常に鋭い極小値を明らかにします。最近提案されたシャープネス対応オプティマイザーでスムーズさを促進することにより、教師あり学習、敵対的学習、対比学習、および転移学習にまたがるさまざまなタスクでのViTおよびMLPミキサーの精度と堅牢性を大幅に向上させます（例：+ 5.3％および+ 11.0％トップ-単純なインセプションスタイルの前処理を使用した、ViT-B / 16およびMixer-B / 16のImageNetでの1つの精度）。最初の数層のよりまばらなアクティブニューロンに改善された滑らかさの属性があることを示します。結果として得られるViTは、大規模な事前トレーニングや強力なデータ拡張なしでImageNetで最初からトレーニングした場合、同様のサイズとスループットのResNetよりも優れています。彼らはまた、より知覚的な注意マップを持っています。モデルのチェックポイントはhttps://github.com/google-research/vision_transformerでリリースされています。

Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pre-training and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rates). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3% and +11.0% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pre-training or strong data augmentations. They also possess more perceptive attention maps. Our model checkpoints are released at https://github.com/google-research/vision_transformer.

updated: Mon Oct 11 2021 23:23:02 GMT+0000 (UTC)

published: Thu Jun 03 2021 02:08:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト