Heterogeneous Graph Learning for Visual Commonsense Reasoning

Weijiang Yu; Jingwen Zhou; Weihao Yu; Xiaodan Liang; Nong Xiao

視覚常識推論のための異種グラフ学習

視覚的常識推論タスクは、正しい答えを予測し、説得力のある推論パスを提供する能力を備えた認知レベルの推論を解決するために研究分野を導くことを目的としています。 > AR。それは、説得力のある推論パスを生成するために、視覚と言語のドメインと知識推論との間の適切なセマンティックアライメントに関して大きな課題を提起します。既存の作品は、解釈可能な推論パスを生成できない強力なエンドツーエンドネットワークに頼るか、視覚的概念と言語単語間のクロスドメインセマンティックアラインメントを無視しながら、視覚的オブジェクト（均質なグラフ）の内部関係のみを探索します。この論文では、視覚と言語の領域を橋渡しするために、グラフ内とグラフ間の推論をシームレスに統合するための新しい異質グラフ学習（HGL）フレームワークを提案します。 HGLは、意味の一致の推論パスをインタラクティブに改良するための、主なビジョンから回答への異種グラフ（VAHG）モジュールとデュアル質問から回答への異種グラフ（QAHG）モジュールで構成されています。さらに、当社のHGLは、コンテキスト投票モジュールを統合して、長期的な視覚的コンテキストを活用してグローバルな推論を向上させます。大規模なVisual Commonsense Reasoningベンチマークの実験は、3つのタスクで提案されたモジュールの優れたパフォーマンスを示しています（Q-> Aで5％の精度、QA-> Rで3.5％、Q-> ARで5.8％を向上）

Visual commonsense reasoning task aims at leading the research field into solving cognition-level reasoning with the ability of predicting correct answers and meanwhile providing convincing reasoning paths, resulting in three sub-tasks i.e., Q->A, QA->R and Q->AR. It poses great challenges over the proper semantic alignment between vision and linguistic domains and knowledge reasoning to generate persuasive reasoning paths. Existing works either resort to a powerful end-to-end network that cannot produce interpretable reasoning paths or solely explore intra-relationship of visual objects (homogeneous graph) while ignoring the cross-domain semantic alignment among visual concepts and linguistic words. In this paper, we propose a new Heterogeneous Graph Learning (HGL) framework for seamlessly integrating the intra-graph and inter-graph reasoning in order to bridge vision and language domain. Our HGL consists of a primal vision-to-answer heterogeneous graph (VAHG) module and a dual question-to-answer heterogeneous graph (QAHG) module to interactively refine reasoning paths for semantic agreement. Moreover, our HGL integrates a contextual voting module to exploit a long-range visual context for better global reasoning. Experiments on the large-scale Visual Commonsense Reasoning benchmark demonstrate the superior performance of our proposed modules on three tasks (improving 5% accuracy on Q->A, 3.5% on QA->R, 5.8% on Q->AR)

updated: Fri Oct 25 2019 01:04:46 GMT+0000 (UTC)

published: Fri Oct 25 2019 01:04:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト