Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

Yao Qin; Chiyuan Zhang; Ting Chen; Balaji Lakshminarayanan; Alex Beutel; Xuezhi Wang

パッチベースの負の拡張によるビジョントランスフォーマーの堅牢性の理解と改善

視覚変換器（ViT）の堅牢性を、パッチベースの特別な構造構造のレンズを通して調査します。つまり、画像を一連の画像パッチとして処理します。 ViTは、変換によって元のセマンティクスが大幅に破壊され、画像が人間に認識されなくなった場合でも、パッチベースの変換に対して驚くほど鈍感であることがわかります。これは、ViTがそのような変換を生き延びた機能を多用しているが、一般に人間にセマンティッククラスを示していないことを示しています。さらなる調査により、これらの機能は有用ですが、堅牢ではないことが示されています。これは、これらの機能でトレーニングされたViTが高い配布精度を達成できるが、配布シフトの下で機能しなくなるためです。この理解から、これらの機能への依存度を下げるようにモデルをトレーニングすることで、ViTの堅牢性と配布外のパフォーマンスを向上させることができるでしょうか。パッチベースの操作で変換された画像をネガティブに拡張されたビューとして使用し、ロバストでない機能を使用しないようにトレーニングを正規化するための損失を提供します。これは、モデルの不変性を強制するためにセマンティックを保持する変換で入力を拡張することに主に焦点を当てている既存の研究を補完する見解です。パッチベースの負の拡張により、ImageNetベースの堅牢性ベンチマークの幅広いセット全体でViTの堅牢性が一貫して向上することを示します。さらに、パッチベースの負の拡張は、従来の（正の）データ拡張を補完し、一緒になってパフォーマンスをさらに向上させることがわかりました。この作業のすべてのコードはオープンソースになります。

We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics and makes the image unrecognizable by humans. This indicates that ViTs heavily use features that survived such transformations but are generally not indicative of the semantic class to humans. Further investigations show that these features are useful but non-robust, as ViTs trained on them can achieve high in-distribution accuracy, but break down under distribution shifts. From this understanding, we ask: can training the model to rely less on these features improve ViT robustness and out-of-distribution performance? We use the images transformed with our patch-based operations as negatively augmented views and offer losses to regularize the training away from using non-robust features. This is a complementary view to existing research that mostly focuses on augmenting inputs with semantic-preserving transformations to enforce models' invariance. We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks. Furthermore, we find our patch-based negative augmentation are complementary to traditional (positive) data augmentation, and together boost the performance further. All the code in this work will be open-sourced.

updated: Fri Oct 15 2021 04:53:18 GMT+0000 (UTC)

published: Fri Oct 15 2021 04:53:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト