Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering

Rajat Koner; Hang Li; Marcel Hildebrandt; Deepan Das; Volker Tresp; Stephan Günnemann

Graphhopper：視覚的な質問応答のためのマルチホップシーングラフ推論

Visual Question Answering（VQA）は、画像に関する自由形式の質問への回答に関係しています。質問の深い意味論的および言語学的理解と、それを画像に存在するさまざまなオブジェクトに関連付ける能力が必要なため、野心的な作業であり、コンピュータービジョンと自然言語処理の両方からのマルチモーダル推論が必要です。知識グラフ推論、コンピュータビジョン、自然言語処理技術を統合することでタスクにアプローチする新しい方法であるGraphhopperを提案します。具体的には、私たちの方法は、シーンエンティティとその意味的および空間的関係に基づいてコンテキスト駆動型の順次推論を実行することに基づいています。最初のステップとして、画像内のオブジェクト、およびそれらの属性とそれらの相互関係を説明するシーングラフを導出します。続いて、強化学習エージェントは、抽出されたシーングラフ上をマルチホップ方式で自律的にナビゲートして、回答を導出するための基礎となる推論パスを生成するようにトレーニングされます。手動でキュレーションされたシーングラフと自動生成されたシーングラフの両方に基づいて、挑戦的なデータセットGQAに関する実験的研究を実施します。私たちの結果は、手動でキュレーションされたシーングラフで人間のパフォーマンスに追いつくことを示しています。さらに、Graphhopperは、手動でキュレーションされたシーングラフと自動生成されたシーングラフの両方で、別の最先端のシーングラフ推論モデルよりも大幅に優れていることがわかりました。

Visual Question Answering (VQA) is concerned with answering free-form questions about an image. Since it requires a deep semantic and linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires multi-modal reasoning from both computer vision and natural language processing. We propose Graphhopper, a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques. Concretely, our method is based on performing context-driven, sequential reasoning based on the scene entities and their semantic and spatial relationships. As a first step, we derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships. Subsequently, a reinforcement learning agent is trained to autonomously navigate in a multi-hop manner over the extracted scene graph to generate reasoning paths, which are the basis for deriving answers. We conduct an experimental study on the challenging dataset GQA, based on both manually curated and automatically generated scene graphs. Our results show that we keep up with a human performance on manually curated scene graphs. Moreover, we find that Graphhopper outperforms another state-of-the-art scene graph reasoning model on both manually curated and automatically generated scene graphs by a significant margin.

updated: Tue Jul 13 2021 18:33:04 GMT+0000 (UTC)

published: Tue Jul 13 2021 18:33:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト