Discrete Representations Strengthen Vision Transformer Robustness

Chengzhi Mao; Lu Jiang; Mostafa Dehghani; Carl Vondrick; Rahul Sukthankar; Irfan Essa

個別の表現により、VisionTransformerの堅牢性が強化されます

Vision Transformer（ViT）は、画像認識のための最先端のアーキテクチャとして登場しています。最近の研究では、ViTは畳み込み対応のものよりも堅牢であることが示唆されていますが、私たちの実験では、ImageNetでトレーニングされたViTはローカルテクスチャに過度に依存しており、形状情報を適切に利用できないことがわかりました。したがって、ViTは、配布されていない実際のデータに一般化するのが困難です。この欠陥に対処するために、ベクトル量子化エンコーダーによって生成された個別のトークンを追加することにより、ViTの入力レイヤーにシンプルで効果的なアーキテクチャの変更を提示します。標準の連続ピクセルトークンとは異なり、離散トークンは小さな摂動下では不変であり、個々に含まれる情報が少ないため、ViTは不変であるグローバル情報を学習するようになります。実験結果は、4つのアーキテクチャバリアントに個別の表現を追加すると、ImageNetのパフォーマンスを維持しながら、7つのImageNet堅牢性ベンチマーク全体でViTの堅牢性が最大12％強化されることを示しています。

Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on ImageNet are overly reliant on local textures and fail to make adequate use of shape information. ViTs thus have difficulties generalizing to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet.

updated: Sat Apr 02 2022 01:51:00 GMT+0000 (UTC)

published: Sat Nov 20 2021 01:49:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト