Confidence-Calibrated Adversarial Training: Generalizing to Unseen Attacks

David Stutz; Matthias Hein; Bernt Schiele

信頼度を調整した敵対者トレーニング：目に見えない攻撃への一般化

敵対者のトレーニングは、特定の脅威モデル（L_∞敵対者の例など）に対して堅牢なモデルを生成します。通常、堅牢性は、他のL_p規範やより大きな摂動など、これまでに見られなかった脅威モデルに一般化されません。自信を持って調整された敵対者訓練（CCAT）は、敵対者の例の低信頼性予測にモデルを偏らせることにより、この問題に対処します。信頼性の低い例を拒否できるようにすることで、堅牢性はトレーニング中に使用される脅威モデルを超えて一般化されます。 L_∞の敵の例のみでトレーニングされたCCATは、より大きなL_∞、L_2、L_1、およびL_0の攻撃、敵のフレーム、遠位の敵の例、および破損した例に対する堅牢性を高め、敵の訓練と比較してより正確な精度をもたらします。徹底的な評価のために、信頼性を最大化することでCCATを直接攻撃する新しいホワイトボックス攻撃とブラックボックス攻撃を開発しました。脅威モデルごとに、最大50回の再起動と5000回の反復を伴う7つの攻撃を使用し、すべての攻撃にわたって信頼しきい値設定にまで及ぶ最悪の場合の堅牢なテストエラーを報告します。

Adversarial training yields robust models against a specific threat model, e.g., L_∞ adversarial examples. Typically robustness does not generalize to previously unseen threat models, e.g., other L_p norms, or larger perturbations. Our confidence-calibrated adversarial training (CCAT) tackles this problem by biasing the model towards low confidence predictions on adversarial examples. By allowing to reject examples with low confidence, robustness generalizes beyond the threat model employed during training. CCAT, trained only on L_∞ adversarial examples, increases robustness against larger L_∞, L_2, L_1 and L_0 attacks, adversarial frames, distal adversarial examples and corrupted examples and yields better clean accuracy compared to adversarial training. For thorough evaluation we developed novel white- and black-box attacks directly attacking CCAT by maximizing confidence. For each threat model, we use 7 attacks with up to 50 restarts and 5000 iterations and report worst-case robust test error, extended to our confidence-thresholded setting, across all attacks.

updated: Tue Jun 30 2020 12:03:44 GMT+0000 (UTC)

published: Mon Oct 14 2019 16:38:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト