Adversarial training yields robust models against a specific threat model, e.g., L_∞ adversarial examples. Typically robustness does not generalize to previously unseen threat models, e.g., other L_p norms, or larger perturbations. Our confidence-calibrated adversarial training (CCAT) tackles this problem by biasing the model towards low confidence predictions on adversarial examples. By allowing to reject examples with low confidence, robustness generalizes beyond the threat model employed during training. CCAT, trained only on L_∞ adversarial examples, increases robustness against larger L_∞, L_2, L_1 and L_0 attacks, adversarial frames, distal adversarial examples and corrupted examples and yields better clean accuracy compared to adversarial training. For thorough evaluation we developed novel white- and black-box attacks directly attacking CCAT by maximizing confidence. For each threat model, we use 7 attacks with up to 50 restarts and 5000 iterations and report worst-case robust test error, extended to our confidence-thresholded setting, across all attacks.
updated: Tue Jun 30 2020 12:03:44 GMT+0000 (UTC)
published: Mon Oct 14 2019 16:38:03 GMT+0000 (UTC)