Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs

Philipp Benz; Soomin Ham; Chaoning Zhang; Adil Karjauv; In So Kweon

VisionTransformerおよびMLP-MixerとCNNの敵対的なロバスト性の比較

畳み込みニューラルネットワーク（CNN）は、過去数年間でコンピュータービジョンアプリケーションの事実上のゴールドスタンダードになりました。しかし、最近、現状に挑戦する新しいモデルアーキテクチャが提案されています。 Vision Transformer（ViT）はアテンションモジュールのみに依存しますが、MLP-Mixerアーキテクチャはセルフアテンションモジュールを多層パーセプトロン（MLP）に置き換えます。 CNNは大きな成功を収めていますが、敵対的な攻撃に対して脆弱であることが広く知られており、セキュリティに敏感なアプリケーションに深刻な懸念を引き起こしています。したがって、コミュニティにとって、新しく提案されたViTおよびMLP-Mixerも敵対的な攻撃に対して脆弱であるかどうかを知ることが重要です。この目的のために、いくつかの敵対的攻撃設定の下での敵対的ロバスト性を経験的に評価し、広く使用されているCNNに対してベンチマークします。全体として、2つのアーキテクチャ、特にViTは、CNNモデルよりも堅牢であることがわかります。おもちゃの例を使用して、CNNの低い敵対的ロバスト性がシフト不変特性に部分的に起因する可能性があるという経験的証拠も提供します。私たちの周波数分析は、最も堅牢なViTアーキテクチャがCNNと比較して低周波数機能に依存する傾向があることを示唆しています。さらに、MLP-Mixerは普遍的な敵対的摂動に対して非常に脆弱であるという興味深い発見があります。

Convolutional Neural Networks (CNNs) have become the de facto gold standard in computer vision applications in the past years. Recently, however, new model architectures have been proposed challenging the status quo. The Vision Transformer (ViT) relies solely on attention modules, while the MLP-Mixer architecture substitutes the self-attention modules with Multi-Layer Perceptrons (MLPs). Despite their great success, CNNs have been widely known to be vulnerable to adversarial attacks, causing serious concerns for security-sensitive applications. Thus, it is critical for the community to know whether the newly proposed ViT and MLP-Mixer are also vulnerable to adversarial attacks. To this end, we empirically evaluate their adversarial robustness under several adversarial attack setups and benchmark them against the widely used CNNs. Overall, we find that the two architectures, especially ViT, are more robust than their CNN models. Using a toy example, we also provide empirical evidence that the lower adversarial robustness of CNNs can be partially attributed to their shift-invariant property. Our frequency analysis suggests that the most robust ViT architectures tend to rely more on low-frequency features compared with CNNs. Additionally, we have an intriguing finding that MLP-Mixer is extremely vulnerable to universal adversarial perturbations.

updated: Mon Oct 11 2021 14:28:50 GMT+0000 (UTC)

published: Wed Oct 06 2021 14:18:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト