Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA

Badri N. Patro; Anupriy; Vinay P. Namboodiri

説明と注意：VQAの注意を引くための2人用ゲーム

この論文では、視覚的質問応答（VQA）タスクに対する注目度の向上を目指しています。注意を引くための監督を提供することは困難です。私たちが行う観察は、さまざまなネットワークのパフォーマンスを説明するためのクラスアクティベーションマッピング（具体的にはGrad-CAM）を通じて得られる視覚的な説明が、監視の手段を形成する可能性があるということです。ただし、アテンションマップの分布とGrad-CAMの分布は異なるため、これらを監視の一形態として直接使用することは適切ではありません。むしろ、視覚的説明と注意マップのサンプルを区別することを目的とする弁別器の使用を提案します。注意と説明の間の二人用ゲームとしての注意領域の敵対的訓練の使用は、注意マップと視覚的説明の分布をより密接にするのに役立つ。重要なことは、そのような監視手段を提供すると、人間の注意により密接に関連する注意マップも生じ、ベースラインのスタックアテンションネットワーク（SAN）モデルよりも大幅に改善されることです。また、VQAタスクのランク相関メトリックが大幅に改善されます。この方法は、最近のMCBベースの方法と組み合わせることもでき、一貫した改善が得られます。また、相関配列（コーラル）、最大平均不一致（MMD）、および平均二乗誤差（MSE）損失に基づくなど、分布を学習するための他の手段との比較を提供し、敵対的損失がアテンションマップの学習の他の形式よりも優れていることを観察します。結果を視覚化することで、この形式の監視を使用して注意マップが改善されるという仮説も確認されます。

In this paper, we aim to obtain improved attention for a visual question answering (VQA) task. It is challenging to provide supervision for attention. An observation we make is that visual explanations as obtained through class activation mappings (specifically Grad-CAM) that are meant to explain the performance of various networks could form a means of supervision. However, as the distributions of attention maps and that of Grad-CAMs differ, it would not be suitable to directly use these as a form of supervision. Rather, we propose the use of a discriminator that aims to distinguish samples of visual explanation and attention maps. The use of adversarial training of the attention regions as a two-player game between attention and explanation serves to bring the distributions of attention maps and visual explanations closer. Significantly, we observe that providing such a means of supervision also results in attention maps that are more closely related to human attention resulting in a substantial improvement over baseline stacked attention network (SAN) models. It also results in a good improvement in rank correlation metric on the VQA task. This method can also be combined with recent MCB based methods and results in consistent improvement. We also provide comparisons with other means for learning distributions such as based on Correlation Alignment (Coral), Maximum Mean Discrepancy (MMD) and Mean Square Error (MSE) losses and observe that the adversarial loss outperforms the other forms of learning the attention maps. Visualization of the results also confirms our hypothesis that attention maps improve using this form of supervision.

updated: Tue Nov 19 2019 22:30:13 GMT+0000 (UTC)

published: Tue Nov 19 2019 22:30:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト