Brittle interpretations: The Vulnerability of TCAV and Other Concept-based Explainability Tools to Adversarial Attack

Davis Brown; Henry Kvinge

脆弱な解釈：敵対的攻撃に対するTCAVおよびその他の概念ベースの説明可能性ツールの脆弱性

モデルの説明可能性の方法は、深層学習の公平性と健全性をテストするためにますます重要になっています。モデルのアクティベーションで人間が解釈できる概念を表すために一連の例を使用する、多くの説明可能な手法が開発されています。この作業では、これらの説明可能性の方法が、分析対象のモデルと同じように、敵対的な攻撃に対して脆弱になる可能性があることを示します。ディープラーニングモデルの説明可能性に対する2つのよく知られた概念ベースのアプローチ、TCAVとファセット特徴の視覚化でこの現象を示します。調査中の概念の例を注意深く混乱させることにより、解釈可能性の方法の出力を根本的に変えることができることを示します。たとえば、縞模様はシマウマの画像を識別する上で重要な要素ではないことを示します。私たちの仕事は、セーフティクリティカルなアプリケーションでは、機械学習パイプラインだけでなく、モデル解釈プロセスにもセキュリティが必要であるという事実を浮き彫りにしています。

Methods for model explainability have become increasingly critical for testing the fairness and soundness of deep learning. A number of explainability techniques have been developed which use a set of examples to represent a human-interpretable concept in a model's activations. In this work we show that these explainability methods can suffer the same vulnerability to adversarial attacks as the models they are meant to analyze. We demonstrate this phenomenon on two well-known concept-based approaches to the explainability of deep learning models: TCAV and faceted feature visualization. We show that by carefully perturbing the examples of the concept that is being investigated, we can radically change the output of the interpretability method, e.g. showing that stripes are not an important factor in identifying images of a zebra. Our work highlights the fact that in safety-critical applications, there is need for security around not only the machine learning pipeline but also the model interpretation process.

updated: Thu Oct 14 2021 02:12:33 GMT+0000 (UTC)

published: Thu Oct 14 2021 02:12:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト