Are Vision Transformers Robust to Patch Perturbations?

Jindong Gu; Volker Tresp; Yao Qin

ビジョントランスフォーマーは、摂動にパッチを当てるのに堅牢ですか？

Vision Transformer（ViT）の最近の進歩は、画像分類におけるその印象的なパフォーマンスを実証しており、畳み込みニューラルネットワーク（CNN）の有望な代替手段となっています。 CNNとは異なり、ViTは入力画像を一連の画像パッチとして表します。パッチベースの入力画像表現は、次の質問を興味深いものにします。個々の入力画像パッチがCNNと比較して、自然な破損または敵対的な摂動で摂動された場合、ViTはどのように機能しますか？この作業では、パッチごとの摂動に対するViTの堅牢性を研究します。驚いたことに、ViTはCNNよりも自然に破損したパッチに対してより堅牢であるのに対し、敵対的なパッチに対してより脆弱であることがわかりました。さらに、注意メカニズムがビジョントランスの堅牢性に大きく影響することを発見しました。具体的には、アテンションモジュールは、自然に破損したパッチを効果的に無視することにより、ViTの堅牢性を向上させるのに役立ちます。ただし、ViTが敵に攻撃されると、注意メカニズムが簡単にだまされて、敵に混乱したパッチに焦点を合わせ、ミスを引き起こす可能性があります。私たちの分析に基づいて、敵対的なパッチに対するViTの堅牢性を向上させるための単純な温度スケーリングベースの方法を提案します。トランスベースのアーキテクチャのセット全体でパッチごとの摂動に対するViTの堅牢性の調査結果、理解、および改善をサポートするために、広範な定性的および定量的実験が実行されます。

Recent advances in Vision Transformer (ViT) have demonstrated its impressive performance in image classification, which makes it a promising alternative to Convolutional Neural Network (CNN). Unlike CNNs, ViT represents an input image as a sequence of image patches. The patch-based input image representation makes the following question interesting: How does ViT perform when individual input image patches are perturbed with natural corruptions or adversarial perturbations, compared to CNNs? In this work, we study the robustness of ViT to patch-wise perturbations. Surprisingly, we find that ViTs are more robust to naturally corrupted patches than CNNs, whereas they are more vulnerable to adversarial patches. Furthermore, we discover that the attention mechanism greatly affects the robustness of vision transformers. Specifically, the attention module can help improve the robustness of ViT by effectively ignoring natural corrupted patches. However, when ViTs are attacked by an adversary, the attention mechanism can be easily fooled to focus more on the adversarially perturbed patches and cause a mistake. Based on our analysis, we propose a simple temperature-scaling based method to improve the robustness of ViT against adversarial patches. Extensive qualitative and quantitative experiments are performed to support our findings, understanding, and improvement of ViT robustness to patch-wise perturbations across a set of transformer-based architectures.

updated: Mon Jul 18 2022 17:24:18 GMT+0000 (UTC)

published: Sat Nov 20 2021 19:00:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト