ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

Mengqi Xue; Qihan Huang; Haofei Zhang; Lechao Cheng; Jie Song; Minghui Wu; Mingli Song

ProtoPFormer: 解釈可能な画像認識のためのビジョントランスフォーマーのプロトタイプ部分への集中

プロトタイプパーツネットワーク (ProtoPNet) は、説明可能な人工知能 (XAI) の自明な特性により、幅広い注目を集め、多くのフォローアップ研究を後押ししました。ただし、ProtoPNet をビジョントランスフォーマー (ViT) バックボーンに直接適用する場合、学習したプロトタイプには「気を散らす」問題があります。バックグラウンドによってアクティブ化される可能性が比較的高く、フォアグラウンドにはあまり注意を払いません。長期的な依存関係をモデル化する強力な機能により、トランスベースの ProtoPNet はプロトタイプの部分に集中することが難しくなり、固有の解釈可能性が著しく損なわれます。この論文では、解釈可能な画像認識のためにViTsを使用したプロトタイプベースの方法を適切かつ効果的に適用するためのプロトタイプ部品変換器（ProtoPFormer）を提案します。提案された方法は、ViT のアーキテクチャ特性に従って、ターゲットの代表的な全体的および部分的な特徴をキャプチャして強調表示するためのグローバルおよびローカルプロトタイプを導入します。グローバルプロトタイプは、オブジェクトのグローバルビューを提供するために採用され、ローカルプロトタイプが前景に集中するように導き、背景の影響を排除します。その後、ローカルプロトタイプは、それぞれのプロトタイプの視覚的部分に集中するように明示的に監視され、全体的な解釈可能性が向上します。広範な実験により、提案されたグローバルおよびローカルのプロトタイプが相互に修正し、共同で最終決定を下すことができることが実証されました。これにより、それぞれ全体およびローカルの観点から関連する意思決定プロセスが忠実かつ透過的に推論されます。さらに、ProtoPFormer は、最先端 (SOTA) のプロトタイプベースのベースラインよりも優れたパフォーマンスと視覚化結果を一貫して達成しています。コードは https://github.com/zju-vipa/ProtoPFormer でリリースされています。

Prototypical part network (ProtoPNet) has drawn wide attention and boosted many follow-up studies due to its self-explanatory property for explainable artificial intelligence (XAI). However, when directly applying ProtoPNet on vision transformer (ViT) backbones, learned prototypes have a "distraction" problem: they have a relatively high probability of being activated by the background and pay less attention to the foreground. The powerful capability of modeling long-term dependency makes the transformer-based ProtoPNet hard to focus on prototypical parts, thus severely impairing its inherent interpretability. This paper proposes prototypical part transformer (ProtoPFormer) for appropriately and effectively applying the prototype-based method with ViTs for interpretable image recognition. The proposed method introduces global and local prototypes for capturing and highlighting the representative holistic and partial features of targets according to the architectural characteristics of ViTs. The global prototypes are adopted to provide the global view of objects to guide local prototypes to concentrate on the foreground while eliminating the influence of the background. Afterwards, local prototypes are explicitly supervised to concentrate on their respective prototypical visual parts, increasing the overall interpretability. Extensive experiments demonstrate that our proposed global and local prototypes can mutually correct each other and jointly make final decisions, which faithfully and transparently reason the decision-making processes associatively from the whole and local perspectives, respectively. Moreover, ProtoPFormer consistently achieves superior performance and visualization results over the state-of-the-art (SOTA) prototype-based baselines. Our code has been released at https://github.com/zju-vipa/ProtoPFormer.

updated: Mon Sep 26 2022 16:18:27 GMT+0000 (UTC)

published: Mon Aug 22 2022 16:36:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト