Self-Supervised Adversarial Example Detection by Disentangled Representation

Zhaoxi Zhang; Leo Yu Zhang; Xufei Zheng; Shengshan Hu; Jinyu Tian; Jiantao Zhou

解きほぐされた表現による自己教師あり敵対例の検出

深層学習モデルは、悪意のある目的のために精巧に設計された敵対的な例に対して脆弱であり、人間の知覚システムには認識されないことが知られています。オートエンコーダは、良性の例のみでトレーニングされた場合、敵対的な例がより大きな再構成エラーをもたらすという仮定に基づいて、（自己監視）敵対的検出に広く使用されています。ただし、トレーニングに敵対的な例がなく、オートエンコーダの一般化能力が強すぎるため、この仮定は実際には常に当てはまるとは限りません。この問題を軽減するために、オートエンコーダ構造の下で画像のもつれを解いた表現によって敵対的な例を検出することを検討します。入力画像をクラス機能とセマンティック機能として解きほぐすことにより、ディスクリミネーターネットワークの支援を受けて、正しくペアになっているクラス/セマンティック機能と誤ってペアになっているクラス/セマンティック機能の両方でオートエンコーダーをトレーニングして、良性と反例を再構築します。これは、敵対的な例の動作を模倣し、オートエンコーダの不要な一般化機能を減らすことができます。最先端の自己監視検出法と比較して、私たちの方法は、さまざまなデータセット（MNIST、Fashion-MNIST、CIFAR-10）、さまざまな敵対者に対して、さまざまな測定（AUC、FPR、TPR）で優れたパフォーマンスを示します。攻撃方法（FGSM、BIM、PGD、DeepFool、およびCW）とさまざまな被害者モデル（8層CNNおよび16層VGG）。さまざまな敵対的攻撃とさまざまな被害者モデル（30の攻撃設定）の下で、私たちの方法を最先端の自己監視検出方法と比較し、ほとんどの攻撃でさまざまな測定（AUC、FPR、TPR）で優れたパフォーマンスを示します設定。理想的には、AUCは1であり、私たちの方法はすべての攻撃に対してCIFAR-10で0.99+を達成します。特に、他のオートエンコーダベースの検出器とは異なり、私たちの方法は適応型の敵に抵抗を与えることができます。

Deep learning models are known to be vulnerable to adversarial examples that are elaborately designed for malicious purposes and are imperceptible to the human perceptual system. Autoencoder, when trained solely over benign examples, has been widely used for (self-supervised) adversarial detection based on the assumption that adversarial examples yield larger reconstruction error. However, because lacking adversarial examples in its training and the too strong generalization ability of autoencoder, this assumption does not always hold true in practice. To alleviate this problem, we explore to detect adversarial examples by disentangled representations of images under the autoencoder structure. By disentangling input images as class features and semantic features, we train an autoencoder, assisted by a discriminator network, over both correctly paired class/semantic features and incorrectly paired class/semantic features to reconstruct benign and counterexamples. This mimics the behavior of adversarial examples and can reduce the unnecessary generalization ability of autoencoder. Compared with the state-of-the-art self-supervised detection methods, our method exhibits better performance in various measurements (i.e., AUC, FPR, TPR) over different datasets (MNIST, Fashion-MNIST and CIFAR-10), different adversarial attack methods (FGSM, BIM, PGD, DeepFool, and CW) and different victim models (8-layer CNN and 16-layer VGG). We compare our method with the state-of-the-art self-supervised detection methods under different adversarial attacks and different victim models (30 attack settings), and it exhibits better performance in various measurements (AUC, FPR, TPR) for most attacks settings. Ideally, AUC is 1 and our method achieves 0.99+ on CIFAR-10 for all attacks. Notably, different from other Autoencoder-based detectors, our method can provide resistance to the adaptive adversary.

updated: Wed May 12 2021 12:37:42 GMT+0000 (UTC)

published: Sat May 08 2021 12:48:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト