Adversarial alignment: Breaking the trade-off between the strength of an attack and its relevance to human perception

Drew Linsley; Pinyuan Feng; Thibaut Boissin; Alekh Karkada Ashok; Thomas Fel; Stephanie Olaiya; Thomas Serre

敵対的な調整: 攻撃の強さと人間の認識との関連性の間のトレードオフを解消する

ディープニューラルネットワーク (DNN) は、敵対的な攻撃、つまり人間には感知できないものの、モデルの視覚的な決定を変更するのに十分強力な入力の摂動に対する基本的な感度を備えていることが知られています。敵対的攻撃は長い間、ディープラーニングの「アキレス腱」であると考えられており、最終的にはモデリングのパラダイムの変化を強いられる可能性があります。それにもかかわらず、現代の大規模 DNN の恐るべき機能により、こうした初期の懸念はいくらか覆い隠されています。敵対的攻撃は引き続き DNN に脅威を与え続けますか?ここでは、ImageNet での DNN の精度が向上し続けるにつれて、敵対的攻撃に対する DNN の堅牢性がどのように進化したかを調査します。敵対的堅牢性を 2 つの異なる方法で測定します。まず、モデルのオブジェクト分類の決定を変更させるために必要な最小の敵対的攻撃を測定します。次に、成功した攻撃が、人間が物体認識の診断対象となる特徴とどの程度一致しているかを測定します。 DNN が ImageNet 上でより適切に成長するにつれて、敵対的攻撃が画像ピクセルに対してより大きく、より簡単に検出可能な変更を引き起こしていることがわかりました。しかし、これらの攻撃はまた、人間が認識のために診断する特徴と一致しにくくなってきています。このトレードオフの原因をより深く理解するために、モデルが人間と同じ機能を活用してタスクを解決することを促す DNN トレーニングルーチンであるニューラルハーモナイザーに注目します。調和された DNN は両方の長所を実現し、検出可能な攻撃を経験し、人間が認識のために診断する機能に影響を与えます。つまり、これらのモデルに対する攻撃は、人間の知覚に同様の影響を引き起こすことで無効になる可能性が高くなります。私たちの調査結果は、敵対的攻撃に対する DNN の感度が、DNN スケール、データスケール、モデルを生物学的知能に合わせたトレーニングルーチンによって軽減できることを示唆しています。

Deep neural networks (DNNs) are known to have a fundamental sensitivity to adversarial attacks, perturbations of the input that are imperceptible to humans yet powerful enough to change the visual decision of a model. Adversarial attacks have long been considered the "Achilles' heel" of deep learning, which may eventually force a shift in modeling paradigms. Nevertheless, the formidable capabilities of modern large-scale DNNs have somewhat eclipsed these early concerns. Do adversarial attacks continue to pose a threat to DNNs? Here, we investigate how the robustness of DNNs to adversarial attacks has evolved as their accuracy on ImageNet has continued to improve. We measure adversarial robustness in two different ways: First, we measure the smallest adversarial attack needed to cause a model to change its object categorization decision. Second, we measure how aligned successful attacks are with the features that humans find diagnostic for object recognition. We find that adversarial attacks are inducing bigger and more easily detectable changes to image pixels as DNNs grow better on ImageNet, but these attacks are also becoming less aligned with features that humans find diagnostic for recognition. To better understand the source of this trade-off, we turn to the neural harmonizer, a DNN training routine that encourages models to leverage the same features as humans to solve tasks. Harmonized DNNs achieve the best of both worlds and experience attacks that are detectable and affect features that humans find diagnostic for recognition, meaning that attacks on these models are more likely to be rendered ineffective by inducing similar effects on human perception. Our findings suggest that the sensitivity of DNNs to adversarial attacks can be mitigated by DNN scale, data scale, and training routines that align models with biological intelligence.

updated: Mon Jun 05 2023 20:26:17 GMT+0000 (UTC)

published: Mon Jun 05 2023 20:26:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト