Crowdsourcing Evaluation of Saliency-based XAI Methods

Xiaotian Lu; Arseny Tolmachev; Tatsuya Yamamoto; Koh Takeuchi; Seiji Okajima; Tomoyoshi Takebayashi; Koji Maruhashi; Hisashi Kashima

顕著性ベースのXAI手法のクラウドソーシング評価

ディープニューラルネットワークによる予測の背後にある理由を理解することは、多くの重要なアプリケーションで人間の信頼を獲得するために重要です。これは、近年のAI（XAI）の説明可能性に対する需要の高まりに反映されています。分類器による決定に寄与する画像の重要な部分を強調する顕著性ベースの特徴帰属方法は、特にコンピュータービジョンの分野で、XAI方法としてよく使用されます。さまざまな顕著性ベースのXAI手法を定量的に比較するために、自動評価スキームのいくつかのアプローチが提案されています。しかし、そのような自動化された評価指標が説明可能性を正しく評価するという保証はなく、自動化された評価スキームによる高い評価は、必ずしも人間にとって高い説明可能性を意味するわけではありません。本研究では、自動評価の代わりに、クラウドソーシングを使用してXAI手法を評価する新しい人間ベースの評価スキームを提案します。私たちの方法は、人間の計算ゲーム「Peek-a-boom」に触発されており、群衆の力を利用することで、さまざまなXAI方法を効率的に比較できます。自動化された群集ベースの評価スキームを使用して、2つのデータセットでさまざまなXAIメソッドの顕著性マップを評価します。私たちの実験は、群集ベースの評価スキームの結果が自動評価スキームの結果とは異なることを示しています。さらに、群集ベースの評価結果をグラウンドトゥルースと見なし、さまざまな自動評価スキームを比較するための定量的なパフォーマンス測定を提供します。また、群集労働者が結果に与える影響についても説明し、群集労働者の能力の変化が結果に大きな影響を与えないことを示します。

Understanding the reasons behind the predictions made by deep neural networks is critical for gaining human trust in many important applications, which is reflected in the increasing demand for explainability in AI (XAI) in recent years. Saliency-based feature attribution methods, which highlight important parts of images that contribute to decisions by classifiers, are often used as XAI methods, especially in the field of computer vision. In order to compare various saliency-based XAI methods quantitatively, several approaches for automated evaluation schemes have been proposed; however, there is no guarantee that such automated evaluation metrics correctly evaluate explainability, and a high rating by an automated evaluation scheme does not necessarily mean a high explainability for humans. In this study, instead of the automated evaluation, we propose a new human-based evaluation scheme using crowdsourcing to evaluate XAI methods. Our method is inspired by a human computation game, "Peek-a-boom", and can efficiently compare different XAI methods by exploiting the power of crowds. We evaluate the saliency maps of various XAI methods on two datasets with automated and crowd-based evaluation schemes. Our experiments show that the result of our crowd-based evaluation scheme is different from those of automated evaluation schemes. In addition, we regard the crowd-based evaluation results as ground truths and provide a quantitative performance measure to compare different automated evaluation schemes. We also discuss the impact of crowd workers on the results and show that the varying ability of crowd workers does not significantly impact the results.

updated: Mon Aug 30 2021 14:10:05 GMT+0000 (UTC)

published: Sun Jun 27 2021 17:37:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト