Detecting Backdoor in Deep Neural Networks via Intentional Adversarial Perturbations

Mingfu Xue; Yinghao Wu; Zhiyu Wu; Jian Wang; Yushu Zhang; Weiqiang Liu

意図的な敵対的摂動によるディープニューラルネットワークのバックドアの検出

最近の調査によると、深層学習モデルは、モデルに埋め込まれたバックドアがバックドアインスタンスが到着したときにトリガーされるバックドア攻撃の影響を受けやすいことが示されています。この論文では、敵対例に基づいた新しいバックドア検出方法を提案します。提案された方法は、意図的な敵対的摂動を活用して、画像にトリガーが含まれているかどうかを検出します。具体的には、信頼できない画像が与えられた場合、敵対的摂動が意図的に入力画像に追加され、摂動された画像のモデルの予測が摂動されていない画像の予測と一致する場合、入力画像はバックドアインスタンスと見なされます。提案された敵対的摂動ベースの方法は、少ない計算リソースを必要とし、画像の視覚的品質を維持します。実験結果によると、提案された防御方法は、Fashion-MNIST、CIFAR-10、GTSRB データセットで、バックドア攻撃の成功率をそれぞれ 99.47%、99.77%、97.89% から 0.37%、0.24%、0.09% に低下させます。さらに、提案された方法は、追加される摂動が非常に小さいため、画像の視覚的品質を維持します。さらに、さまざまな設定 (トリガーの透過性、トリガーサイズ、トリガーパターン) での攻撃の場合、提案された方法の誤認率は、Fashion-MNIST、CIFAR-10、GTSRB データセットで 1.2%、0.3%、0.04% と低くなっています。これは、提案された方法がさまざまな攻撃設定の下でバックドア攻撃に対して高い防御性能を達成できることを示しています。

Recent researches show that deep learning model is susceptible to backdoor attacks where the backdoor embedded in the model will be triggered when a backdoor instance arrives. In this paper, a novel backdoor detection method based on adversarial examples is proposed. The proposed method leverages intentional adversarial perturbations to detect whether the image contains a trigger, which can be applied in two scenarios (sanitize the training set in training stage and detect the backdoor instances in inference stage). Specifically, given an untrusted image, the adversarial perturbation is added to the input image intentionally, if the prediction of model on the perturbed image is consistent with that on the unperturbed image, the input image will be considered as a backdoor instance. The proposed adversarial perturbation based method requires low computational resources and maintains the visual quality of the images. Experimental results show that, the proposed defense method reduces the backdoor attack success rates from 99.47%, 99.77% and 97.89% to 0.37%, 0.24% and 0.09% on Fashion-MNIST, CIFAR-10 and GTSRB datasets, respectively. Besides, the proposed method maintains the visual quality of the image as the added perturbation is very small. In addition, for attacks under different settings (trigger transparency, trigger size and trigger pattern), the false acceptance rates of the proposed method are as low as 1.2%, 0.3% and 0.04% on Fashion-MNIST, CIFAR-10 and GTSRB datasets, respectively, which demonstrates that the proposed method can achieve high defense performance against backdoor attacks under different attack settings.

updated: Sat May 29 2021 09:33:05 GMT+0000 (UTC)

published: Sat May 29 2021 09:33:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト