Uncertify: Attacks Against Neural Network Certification

Tobias Lorenz; Marta Kwiatkowska; Mario Fritz

認定解除：ニューラルネットワーク認定に対する攻撃

信頼性が高く、堅牢で、安全なAIシステムに向けた重要な概念は、AIの予測が信頼できない場合にフォールバック戦略を実装するというアイデアです。ニューラルネットワークの認証器は、敵対的な例を使用して、回避攻撃に対する証明可能な堅牢性の保証に向けて大きな進歩を遂げました。これらのメソッドは、特定のクラスの操作または攻撃が結果を変更できなかったといういくつかの予測を保証します。保証のない残りの予測については、メソッドは予測を行わず、フォールバック戦略を呼び出す必要があります。これは通常、コストが高く、精度が低く、人間のオペレーターが関与することさえあります。これは安全で安全なAIに向けた重要な概念ですが、このようなフォールバック戦略は敵によって意図的にトリガーされる可能性があるため、この戦略には独自のセキュリティリスクが伴うことを初めて示します。特に、実際のアプリケーションパイプラインでの認証者に対するトレーニング時の攻撃の最初の体系的な分析を実施し、システム全体を劣化させるために悪用される可能性のある新しい脅威ベクトルを特定します。これらの洞察を使用して、ネットワーク認証者に対する2つのバックドア攻撃を設計します。これにより、認証された堅牢性が大幅に低下する可能性があります。たとえば、トレーニング中に1％の汚染データを追加するだけで、認定された堅牢性を最大95パーセントポイント低下させ、認定者を事実上役に立たなくすることができます。このような新しい攻撃がシステム全体の整合性または可用性をどのように損なう可能性があるかを分析します。複数のデータセット、モデルアーキテクチャ、および認証者を対象とした広範な実験により、これらの攻撃の幅広い適用性が実証されています。潜在的な防御に関する最初の調査では、現在のアプローチでは問題を軽減するには不十分であることが示され、新しい、より具体的なソリューションの必要性が浮き彫りになりました。

A key concept towards reliable, robust, and safe AI systems is the idea to implement fallback strategies when predictions of the AI cannot be trusted. Certifiers for neural networks have made great progress towards provable robustness guarantees against evasion attacks using adversarial examples. These methods guarantee for some predictions that a certain class of manipulations or attacks could not have changed the outcome. For the remaining predictions without guarantees, the method abstains from making a prediction and a fallback strategy needs to be invoked, which is typically more costly, less accurate, or even involves a human operator. While this is a key concept towards safe and secure AI, we show for the first time that this strategy comes with its own security risks, as such fallback strategies can be deliberately triggered by an adversary. In particular, we conduct the first systematic analysis of training-time attacks against certifiers in practical application pipelines, identifying new threat vectors that can be exploited to degrade the overall system. Using these insights, we design two backdoor attacks against network certifiers, which can drastically reduce certified robustness. For example, adding 1% poisoned data during training is sufficient to reduce certified robustness by up to 95 percentage points, effectively rendering the certifier useless. We analyze how such novel attacks can compromise the overall system's integrity or availability. Our extensive experiments across multiple datasets, model architectures, and certifiers demonstrate the wide applicability of these attacks. A first investigation into potential defenses shows that current approaches are insufficient to mitigate the issue, highlighting the need for new, more specific solutions.

updated: Fri May 13 2022 12:11:56 GMT+0000 (UTC)

published: Wed Aug 25 2021 15:49:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト