Reveal of Vision Transformers Robustness against Adversarial Attacks

Ahmed Aldahdooh; Wassim Hamidouche; Olivier Deforges

ビジョントランスフォーマーの敵対的攻撃に対する堅牢性を明らかにする

注意に基づくネットワークは、画像分類などの多くのコンピュータービジョンタスクで最先端のパフォーマンスを実現しています。畳み込みニューラルネットワーク (CNN) とは異なり、バニラビジョントランスフォーマー (ViT) の主要部分は、入力画像のグローバルコンテキストを模倣する力をもたらすアテンションブロックです。このパワーはデータ飢餓であり、したがって、トレーニングデータが大きいほど、パフォーマンスが向上します。この制限を克服するために、多くの ViT ベースのネットワーク、つまりハイブリッド ViT が、トレーニング中にローカルコンテキストを含めることが提案されています。敵対的攻撃に対する ViT とその亜種の堅牢性は、文献に広く投資されていません。いくつかの堅牢性の属性は、以前のいくつかの作品で明らかにされたため、より多くの洞察の堅牢性の属性はまだ明らかにされていません。この作業では、前処理防御方法を適用した後、1) CNN と比較したさまざまな L_p ベースの敵対的攻撃に対する、および 2) 敵対的事例 (AE) の下での ViT バリアントの堅牢性を研究します。そのために、ImageNet-1k からの 1000 枚の画像で一連の実験を実行し、バニラ ViT またはハイブリッド ViT が CNN よりも堅牢であることを明らかにする分析を提供します。たとえば、1) Vanilla ViT または Hybrid-ViT は、L_0、L_1、L_2、L_∞ ベース、および Color Channel Perturbations (CCP) 攻撃の下で CNN よりも堅牢であることがわかりました。 2) バニラ ViT は、主に高周波成分を減らす前処理防御に反応しませんが、ハイブリッド ViT はそのような防御により反応します。 3) CCP は前処理防御として使用でき、より大きな ViT バリアントは他のモデルよりも応答性が高いことがわかっています。さらに、特徴マップ、アテンションマップ、Grad-CAM 可視化と画質測定値、摂動のエネルギースペクトルが、アテンションベースのモデルの洞察を理解するために提供されます。

Attention-based networks have achieved state-of-the-art performance in many computer vision tasks, such as image classification. Unlike Convolutional Neural Network (CNN), the major part of the vanilla Vision Transformer (ViT) is the attention block that brings the power of mimicking the global context of the input image. This power is data hunger and hence, the larger the training data the better the performance. To overcome this limitation, many ViT-based networks, or hybrid-ViT, have been proposed to include local context during the training. The robustness of ViTs and its variants against adversarial attacks has not been widely invested in the literature. Some robustness attributes were revealed in few previous works and hence, more insight robustness attributes are yet unrevealed. This work studies the robustness of ViT variants 1) against different L_p-based adversarial attacks in comparison with CNNs and 2) under Adversarial Examples (AEs) after applying preprocessing defense methods. To that end, we run a set of experiments on 1000 images from ImageNet-1k and then provide an analysis that reveals that vanilla ViT or hybrid-ViT are more robust than CNNs. For instance, we found that 1) Vanilla ViTs or hybrid-ViTs are more robust than CNNs under L_0, L_1, L_2, L_∞-based, and Color Channel Perturbations (CCP) attacks. 2) Vanilla ViTs are not responding to preprocessing defenses that mainly reduce the high frequency components while, hybrid-ViTs are more responsive to such defense. 3) CCP can be used as a preprocessing defense and larger ViT variants are found to be more responsive than other models. Furthermore, feature maps, attention maps, and Grad-CAM visualization jointly with image quality measures, and perturbations' energy spectrum are provided for an insight understanding of attention-based models.

updated: Mon Jun 07 2021 15:59:49 GMT+0000 (UTC)

published: Mon Jun 07 2021 15:59:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト