Reveal of Vision Transformers Robustness against Adversarial Attacks

Ahmed Aldahdooh; Wassim Hamidouche; Olivier Deforges

敵対的な攻撃に対するビジョントランスフォーマーの堅牢性を明らかにする

バニラビジョントランスフォーマー（ViT）の主要部分は、入力画像のグローバルコンテキストを模倣する力をもたらすアテンションブロックです。パフォーマンスを向上させるには、ViTに大規模なトレーニングデータが必要です。このデータ不足の制限を克服するために、多くのViTベースのネットワークまたはハイブリッドViTが、トレーニング中にローカルコンテキストを含めることが提案されています。敵対的攻撃に対するViTとそのバリアントの堅牢性は、CNNのような文献では広く調査されていません。この作業では、ViTバリアントの堅牢性を1）CNNと比較したさまざまなLpベースの敵対的攻撃、2）前処理防御手法を適用した後の敵対的例（AE）、3）期待以上の変換（EOT）フレームワークを使用した適応攻撃の下で研究します。。そのために、ImageNet-1kからの1000枚の画像に対して一連の実験を実行し、バニラViTまたはハイブリッドViTがCNNよりも堅牢であることを明らかにする分析を提供します。たとえば、1）バニラViTまたはハイブリッドViTは、Lpベースの攻撃および適応型攻撃の下でCNNよりも堅牢であることがわかりました。 2）ハイブリッドViTとは異なり、Vanilla ViTは、主に高周波成分を低減する前処理防御に応答しません。さらに、特徴マップ、注意マップ、Grad-CAMの視覚化と画質測定値、および摂動のエネルギースペクトルが提供され、注意ベースのモデルを洞察的に理解できます。

The major part of the vanilla vision transformer (ViT) is the attention block that brings the power of mimicking the global context of the input image. For better performance, ViT needs large-scale training data. To overcome this data hunger limitation, many ViT-based networks, or hybrid-ViT, have been proposed to include local context during the training. The robustness of ViTs and its variants against adversarial attacks has not been widely investigated in the literature like CNNs. This work studies the robustness of ViT variants 1) against different Lp-based adversarial attacks in comparison with CNNs, 2) under adversarial examples (AEs) after applying preprocessing defense methods and 3) under the adaptive attacks using expectation over transformation (EOT) framework. To that end, we run a set of experiments on 1000 images from ImageNet-1k and then provide an analysis that reveals that vanilla ViT or hybrid-ViT are more robust than CNNs. For instance, we found that 1) Vanilla ViTs or hybrid-ViTs are more robust than CNNs under Lp-based attacks and under adaptive attacks. 2) Unlike hybrid-ViTs, Vanilla ViTs are not responding to preprocessing defenses that mainly reduce the high frequency components. Furthermore, feature maps, attention maps, and Grad-CAM visualization jointly with image quality measures, and perturbations' energy spectrum are provided for an insight understanding of attention-based models.

updated: Mon Sep 20 2021 11:48:45 GMT+0000 (UTC)

published: Mon Jun 07 2021 15:59:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト