Delving Deep into the Generalization of Vision Transformers under Distribution Shifts

Chongzhi Zhang; Mingyuan Zhang; Shanghang Zhang; Daisheng Jin; Qiang Zhou; Zhongang Cai; Haiyu Zhao; Shuai Yi; Xianglong Liu; Ziwei Liu

流通シフトの下でのビジョントランスフォーマーの一般化を深く掘り下げる

最近、Vision Transformers（ViT）は、さまざまなビジョンタスクで印象的な結果を達成しています。しかし、さまざまな分布シフトの下でのそれらの一般化能力はほとんど理解されていません。この作業では、ViTの配布外の一般化に関する包括的な研究を提供します。体系的な調査をサポートするために、まず、分布シフトを5つの概念グループ（破損シフト、背景シフト、テクスチャシフト、破壊シフト、スタイルシフト）に分類することにより、分布シフトの分類法を示します。次に、分布シフトのさまざまなグループの下でViTバリアントの広範な評価を実行し、それらの一般化能力をCNNと比較します。いくつかの重要な観察結果が得られます。1）ViTは、複数の分布シフトの下でCNNよりも一般化されます。同じかそれより少ないパラメーターで、ViTはほとんどの分布シフトの下でトップ1の精度で対応するCNNより5％以上進んでいます。 2）ViTが大きくなると、配布内と配布外のパフォーマンスのギャップが徐々に狭くなります。 ViTの一般化をさらに改善するために、敵対的学習、情報理論、および自己教師あり学習を統合することにより、一般化拡張ViTを設計します。 3種類の一般化が強化されたViTを調査することにより、それらの勾配感度を観察し、安定したトレーニングプロセスを実現するためのよりスムーズな学習戦略を設計します。トレーニングスキームを変更することで、バニラViTからの配信外データに対するパフォーマンスを4％向上させることができます。 3つの一般化拡張ViTを対応するCNNと包括的に比較し、次のことを確認します。1）拡張モデルの場合、ViTが大きいほど、分布外の一般化のメリットが大きくなります。 2）一般化が強化されたViTは、対応するCNNよりもハイパーパラメーターに敏感です。私たちの包括的な研究が、より一般化可能な学習アーキテクチャの設計に光を当てることができることを願っています。

Recently, Vision Transformers (ViTs) have achieved impressive results on various vision tasks. Yet, their generalization ability under different distribution shifts is rarely understood. In this work, we provide a comprehensive study on the out-of-distribution generalization of ViTs. To support a systematic investigation, we first present a taxonomy of distribution shifts by categorizing them into five conceptual groups: corruption shift, background shift, texture shift, destruction shift, and style shift. Then we perform extensive evaluations of ViT variants under different groups of distribution shifts and compare their generalization ability with CNNs. Several important observations are obtained: 1) ViTs generalize better than CNNs under multiple distribution shifts. With the same or fewer parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most distribution shifts. 2) Larger ViTs gradually narrow the in-distribution and out-of-distribution performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs by integrating adversarial learning, information theory, and self-supervised learning. By investigating three types of generalization-enhanced ViTs, we observe their gradient-sensitivity and design a smoother learning strategy to achieve a stable training process. With modified training schemes, we achieve improvements on performance towards out-of-distribution data by 4% from vanilla ViTs. We comprehensively compare three generalization-enhanced ViTs with their corresponding CNNs, and observe that: 1) For the enhanced model, larger ViTs still benefit more for the out-of-distribution generalization. 2) generalization-enhanced ViTs are more sensitive to the hyper-parameters than corresponding CNNs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.

updated: Fri Jun 18 2021 16:48:04 GMT+0000 (UTC)

published: Mon Jun 14 2021 17:21:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト