Adversarial Fine-tuning for Backdoor Defense: Connect Adversarial Examples to Triggered Samples

Bingxu Mu; Le Wang; Zhenxing Niu

バックドア防御のための敵対的な微調整：敵対的な例をトリガーされたサンプルに接続する

ディープニューラルネットワーク（DNN）は、バックドア攻撃に対して脆弱であることが知られています。つまり、トレーニング時にバックドアトリガーが設定されると、感染したDNNモデルは、トリガーが埋め込まれたテストサンプルをターゲットラベルとして誤って分類します。バックドア攻撃はステルスであるため、感染したモデルからバックドアを検出または消去することは困難です。この論文では、感染したモデルの敵対的な例を活用することにより、バックドアトリガーを消去するための新しい敵対的微調整（AFT）アプローチを提案します。感染したモデルの場合、その敵対的な例は、トリガーされたサンプルと同様の動作をすることがわかります。このような観察に基づいて、バックドア攻撃の基盤を壊すようにAFTを設計します（つまり、トリガーとターゲットラベルの間の強い相関関係）。経験的に、5つの最先端のバックドア攻撃に対して、AFTは、既存の防御方法を大幅に上回る、クリーンなサンプルでの明らかなパフォーマンス低下なしに、バックドアトリガーを効果的に消去できることを示します。

Deep neural networks (DNNs) are known to be vulnerable to backdoor attacks, i.e., a backdoor trigger planted at training time, the infected DNN model would misclassify any testing sample embedded with the trigger as target label. Due to the stealthiness of backdoor attacks, it is hard either to detect or erase the backdoor from infected models. In this paper, we propose a new Adversarial Fine-Tuning (AFT) approach to erase backdoor triggers by leveraging adversarial examples of the infected model. For an infected model, we observe that its adversarial examples have similar behaviors as its triggered samples. Based on such observation, we design the AFT to break the foundation of the backdoor attack (i.e., the strong correlation between a trigger and a target label). We empirically show that, against 5 state-of-the-art backdoor attacks, AFT can effectively erase the backdoor triggers without obvious performance degradation on clean samples, which significantly outperforms existing defense methods.

updated: Sun Feb 13 2022 13:41:15 GMT+0000 (UTC)

published: Sun Feb 13 2022 13:41:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト