Vision Transformers are Robust Learners

Sayak Paul; Pin-Yu Chen

ビジョントランスフォーマーは堅牢な学習者です

複数の自己注意層で構成されるトランスフォーマーは、さまざまなデータモダリティに適用できる一般的な学習プリミティブに向けて強い期待を抱いています。これには、コンピュータービジョンの最近の進歩により、パラメーター効率が向上し、最先端の（SOTA）標準精度が達成されます。自己注意は、モデルが入力データ内に存在するさまざまなコンポーネントを体系的に調整するのに役立つため、モデルの堅牢性ベンチマークの下でそのパフォーマンスを調査する根拠を残します。この作業では、一般的な破損や摂動、分布の変化、および自然な敵対的な例に対するVision Transformer（ViT）の堅牢性を研究します。堅牢な分類に関する6つの異なるImageNetデータセットを使用して、ViTモデルとSOTA畳み込みニューラルネットワーク（CNN）であるBig-Transferの包括的なパフォーマンス比較を行います。次に、体系的に設計された一連の6つの実験を通じて、ViTが実際により堅牢な学習者である理由を説明するために、定量的および定性的な指標の両方を提供する分析を示します。たとえば、パラメータが少なく、データセットとトレーニング前の組み合わせが類似している場合、ViTはImageNet-Aで28.10％のトップ1精度を提供します。これは、BiTの同等のバリアントよりも4.3倍高くなります。画像マスキング、フーリエスペクトル感度、および離散コサインエネルギースペクトルの広がりに関する分析により、堅牢性の向上に寄与するViTの興味深い特性が明らかになりました。実験を再現するためのコードは、https：//git.io/J3VO0から入手できます。

Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy with better parameter efficiency. Since self-attention helps a model systematically align different components present inside the input data, it leaves grounds to investigate its performance under model robustness benchmarks. In this work, we study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer. Through a series of six systematically designed experiments, we then present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners. For example, with fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1 accuracy of 28.10% on ImageNet-A which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved robustness. Code for reproducing our experiments is available here: https://git.io/J3VO0.

updated: Tue May 18 2021 04:02:06 GMT+0000 (UTC)

published: Mon May 17 2021 02:39:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト