Interpretable Visual Understanding with Cognitive Attention Network

Xuejiao Tang; Wenbin Zhang; Yi Yu; Kea Turner; Tyler Derr; Mengyu Wang; Eirini Ntoutsi

認知的注意ネットワークによる解釈可能な視覚的理解

認識レベルでの画像理解は目覚ましい進歩を遂げましたが、信頼できる視覚シーンの理解には、認識レベルだけでなく認知レベルでも包括的な画像理解が必要です。これには、マルチソース情報を活用し、さまざまなレベルの理解と広範な常識を学ぶ必要があります。知識。この論文では、解釈可能な視覚的理解を達成するための視覚的常識推論のための新しい認知的注意ネットワーク（CAN）を提案します。具体的には、まず画像とテキストの情報をまとめて融合する画像とテキストの融合モジュールを紹介します。次に、新しい推論モジュールは、画像、クエリ、および応答の間の常識をエンコードするように設計されています。大規模な視覚的常識推論（VCR）ベンチマークデータセットに関する広範な実験は、私たちのアプローチの有効性を示しています。実装はhttps://github.com/tanjatang/CANで公開されています

While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach. The implementation is publicly available at https://github.com/tanjatang/CAN

updated: Fri Aug 06 2021 02:57:43 GMT+0000 (UTC)

published: Fri Aug 06 2021 02:57:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト