COIN: Counterfactual Image Generation for VQA Interpretation

Zeyd Boukhers; Timo Hartmann; Jan Jürjens

COIN：VQA解釈のための反事実的画像生成

自然言語処理とコンピュータビジョンベースのモデルの大幅な進歩により、ビジュアル質問応答（VQA）システムはよりインテリジェントで高度になっています。ただし、比較的複雑な質問を処理する場合は、依然としてエラーが発生しやすくなります。したがって、結果を採用する前に、VQAモデルの動作を理解することが重要です。この論文では、反事実的画像を生成することにより、VQAモデルの解釈可能性アプローチを紹介します。具体的には、生成された画像は元の画像への変更が最小限であると想定され、VQAモデルが異なる答えを出すように導きます。さらに、私たちのアプローチは、生成された画像がリアルであることを保証します。モデルの解釈可能性を評価するために定量的メトリックを使用することはできないため、アプローチのさまざまな側面を評価するためにユーザー調査を実施しました。単一の画像でのVQAモデルの結果の解釈に加えて、得られた結果とディスカッションは、VQAモデルの動作の広範な説明を提供します。

Due to the significant advancement of Natural Language Processing and Computer Vision-based models, Visual Question Answering (VQA) systems are becoming more intelligent and advanced. However, they are still error-prone when dealing with relatively complex questions. Therefore, it is important to understand the behaviour of the VQA models before adopting their results. In this paper, we introduce an interpretability approach for VQA models by generating counterfactual images. Specifically, the generated image is supposed to have the minimal possible change to the original image and leads the VQA model to give a different answer. In addition, our approach ensures that the generated image is realistic. Since quantitative metrics cannot be employed to evaluate the interpretability of the model, we carried out a user study to assess different aspects of our approach. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.

updated: Mon Jan 10 2022 13:51:35 GMT+0000 (UTC)

published: Mon Jan 10 2022 13:51:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト