Interpreting Vision and Language Generative Models with Semantic Visual Priors

Michele Cafagna; Lina M. Rojas-Barahona; Kees van Deemter; Albert Gatt

セマンティックビジュアルプライアを使用した視覚および言語生成モデルの解釈

画像からテキストへのモデルに適用されると、解釈可能性メソッドはトークンごとの説明を提供することがよくあります。つまり、生成されたシーケンスの各トークンの視覚的な説明を計算します。これらの説明は計算コストが高く、モデルの出力を包括的に説明することはできません。したがって、これらのモデルは、最終的に誤解を招く説明につながる何らかの近似を必要とすることがよくあります。出力シーケンス全体の意味表現を活用して、包括的で意味のある説明を生成できる SHAP に基づくフレームワークを開発します。さらに、ビジュアルバックボーンのセマンティックプライアを活用することで、大規模モデルでの Shapley 値の効率的な計算を可能にする任意の数の特徴を抽出し、同時に非常に意味のある視覚的説明を生成します。私たちの方法が、従来の方法よりも低い計算コストで意味的により表現力豊かな説明を生成し、他の説明可能性方法よりも一般化できることを示します。

When applied to Image-to-text models, interpretability methods often provide token-by-token explanations namely, they compute a visual explanation for each token of the generated sequence. Those explanations are expensive to compute and unable to comprehensively explain the model's output. Therefore, these models often require some sort of approximation that eventually leads to misleading explanations. We develop a framework based on SHAP, that allows for generating comprehensive, meaningful explanations leveraging the meaning representation of the output sequence as a whole. Moreover, by exploiting semantic priors in the visual backbone, we extract an arbitrary number of features that allows the efficient computation of Shapley values on large-scale models, generating at the same time highly meaningful visual explanations. We demonstrate that our method generates semantically more expressive explanations than traditional methods at a lower compute cost and that it can be generalized over other explainability methods.

updated: Fri Apr 28 2023 17:10:08 GMT+0000 (UTC)

published: Fri Apr 28 2023 17:10:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト