VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Estelle Aflalo; Meng Du; Shao-Yen Tseng; Yongfei Liu; Chenfei Wu; Nan Duan; Vasudev Lal

VL-InterpreT：視覚言語トランスフォーマーを解釈するためのインタラクティブな視覚化ツール

トランスベースのモデルの飛躍的進歩は、NLP分野だけでなく、ビジョンおよびマルチモーダルシステムにも革命をもたらしました。ただし、視覚化および解釈可能性ツールがNLPモデルで利用できるようになりましたが、視覚の内部メカニズムとマルチモーダルトランスフォーマーはほとんど不透明なままです。これらのトランスフォーマーの成功に伴い、これらのブラックボックスを解明することで、より高性能で信頼性の高いモデルが実現するため、トランスフォーマーの内部動作を理解することがますます重要になっています。この探求に貢献するために、マルチモーダルトランスフォーマーの注意と隠された表現を解釈するための新しいインタラクティブな視覚化を提供するVL-InterpreTを提案します。 VL-InterpreTは、タスクにとらわれない統合ツールであり、（1）ビジョンと言語コンポーネントの両方について、すべてのレイヤーのアテンションヘッドのさまざまな統計を追跡し、（2）読みやすいヒートマップを通じて、クロスモーダルおよびイントラモーダルのアテンションを視覚化します。（3）トランスフォーマー層を通過するときに、ビジョンと言語トークンの非表示の表現をプロットします。このホワイトペーパーでは、視覚常識推論（VCR）とWebQAのタスクにおけるエンドツーエンドの事前トレーニングビジョン言語マルチモーダルトランスベースモデルであるKD-VLPの分析を通じて、VL-InterpreTの機能を示します。視覚的な質問応答ベンチマーク。さらに、ツールを通じて学習したマルチモーダル変圧器の動作に関するいくつかの興味深い調査結果も示します。

Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems. However, although visualization and interpretability tools have become available for NLP models, internal mechanisms of vision and multimodal transformers remain largely opaque. With the success of these transformers, it is increasingly critical to understand their inner workings, as unraveling these black-boxes will lead to more capable and trustworthy models. To contribute to this quest, we propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers. VL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety of statistics in attention heads throughout all layers for both vision and language components, (2) visualizes cross-modal and intra-modal attentions through easily readable heatmaps, and (3) plots the hidden representations of vision and language tokens as they pass through the transformer layers. In this paper, we demonstrate the functionalities of VL-InterpreT through the analysis of KD-VLP, an end-to-end pretraining vision-language multimodal transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and WebQA, two visual question answering benchmarks. Furthermore, we also present a few interesting findings about multimodal transformer behaviors that were learned through our tool.

updated: Mon Aug 22 2022 22:25:59 GMT+0000 (UTC)

published: Wed Mar 30 2022 05:25:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト