Attention Mechanism based Cognition-level Scene Understanding

Xuejiao Tang; Tai Le Quy; Eirini Ntoutsi; Kea Turner; Vasile Palade; Israat Haque; Peng Xu; Chris Brown; Wenbin Zhang

注意メカニズムベースの認知レベルのシーン理解

質問画像の入力が与えられると、Visual Commonsense Reasoning（VCR）モデルは、対応する理論的根拠を使用して回答を予測できます。これには、現実世界からの推論能力が必要です。マルチソース情報を活用し、さまざまなレベルの理解と広範な常識的な知識を学ぶことを要求するVCRタスクは、認知レベルのシーン理解タスクです。 VCRタスクは、視覚的な質問応答、自動車両システム、臨床意思決定支援などの幅広いアプリケーションにより、研究者の関心を呼んでいます。 VCRタスクを解決するための以前のアプローチは、一般に、長い依存関係でエンコードされたモデルを使用したメモリの事前トレーニングまたは活用に依存しています。ただし、これらのアプローチは、一般化の欠如と長いシーケンスでの情報の損失に悩まされています。本論文では、視覚テキスト情報を効率的に融合し、意味情報を並列にエンコードして、モデルが認知レベルの推論のために豊富な情報をキャプチャできるようにする、並列注意ベースの認知VCRネットワークPAVCRを提案します。広範な実験は、提案されたモデルがベンチマークVCRデータセットの既存の方法に比べて大幅な改善をもたらすことを示しています。さらに、提案されたモデルは、視覚的な常識的な推論への直感的な解釈を提供します。

Given a question-image input, the Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world. The VCR task, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding task. The VCR task has aroused researchers' interest due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models. However, these approaches suffer from a lack of generalizability and losing information in long sequences. In this paper, we propose a parallel attention-based cognitive VCR network PAVCR, which fuses visual-textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides intuitive interpretation into visual commonsense reasoning.

updated: Sun Apr 17 2022 15:04:44 GMT+0000 (UTC)

published: Sun Apr 17 2022 15:04:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト