Do Input Gradients Highlight Discriminative Features?

Harshay Shah; Prateek Jain; Praneeth Netrapalli

入力勾配は識別機能を強調していますか？

モデル予測のインスタンス固有の説明を提供する事後勾配ベースの解釈可能性手法[Simonyanet al。、2013、Smilkov et al。、2017]は、多くの場合、仮定（A）に基づいています：入力勾配の大きさ-ロジットの勾配入力に関して-識別可能なタスク関連の機能を騒々しく強調します。この作業では、3つのアプローチを使用して仮定（A）の妥当性をテストします。まず、4つの画像分類ベンチマークで仮定（A）をテストするために、評価フレームワークDiffROARを開発します。私たちの結果は、（i）標準モデルの入力勾配（つまり、元のデータでトレーニングされた）が（A）に著しく違反する可能性があるのに対し、（ii）敵対的にロバストなモデルの入力勾配は（A）を満たすことを示唆しています。次に、MNISTベースの半現実データセットであるBlockMNISTを紹介します。これは、設計上、識別機能の事前知識をエンコードします。 BlockMNISTの分析では、この情報を活用して、標準モデルと堅牢なモデルの入力勾配属性の違いを検証および特性評価します。最後に、私たちの経験的発見がBlockMNISTデータセットの簡略化されたバージョンに当てはまることを理論的に証明します。具体的には、このデータセットでトレーニングされた標準の1つの隠れ層MLPの入力勾配がインスタンス固有の信号座標を強調せず、したがって仮定（A）に著しく違反していることを証明します。私たちの調査結果は、解釈可能性における一般的な仮定を反証可能な方法で形式化してテストする必要性を動機付けています[Leavitt and Morcos、2020]。 DiffROAR評価フレームワークとBlockMNISTベースのデータセットは、インスタンス固有の解釈可能性メソッドを監査するための健全性チェックとして機能できると考えています。コードとデータはhttps://github.com/harshays/inputgradientsで入手できます。

Post-hoc gradient-based interpretability methods [Simonyan et al., 2013, Smilkov et al., 2017] that provide instance-specific explanations of model predictions are often based on assumption (A): magnitude of input gradients -- gradients of logits with respect to input -- noisily highlight discriminative task-relevant features. In this work, we test the validity of assumption (A) using a three-pronged approach. First, we develop an evaluation framework, DiffROAR, to test assumption (A) on four image classification benchmarks. Our results suggest that (i) input gradients of standard models (i.e., trained on original data) may grossly violate (A), whereas (ii) input gradients of adversarially robust models satisfy (A). Second, we introduce BlockMNIST, an MNIST-based semi-real dataset, that by design encodes a priori knowledge of discriminative features. Our analysis on BlockMNIST leverages this information to validate as well as characterize differences between input gradient attributions of standard and robust models. Finally, we theoretically prove that our empirical findings hold on a simplified version of the BlockMNIST dataset. Specifically, we prove that input gradients of standard one-hidden-layer MLPs trained on this dataset do not highlight instance-specific signal coordinates, thus grossly violating assumption (A). Our findings motivate the need to formalize and test common assumptions in interpretability in a falsifiable manner [Leavitt and Morcos, 2020]. We believe that the DiffROAR evaluation framework and BlockMNIST-based datasets can serve as sanity checks to audit instance-specific interpretability methods; code and data available at https://github.com/harshays/inputgradients.

updated: Tue Oct 26 2021 14:28:05 GMT+0000 (UTC)

published: Thu Feb 25 2021 11:04:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト