GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering

Weixin Liang; Yanhao Jiang; Zixuan Liu

GraghVQA: グラフベースの視覚的質問応答のための言語ガイド付きグラフニューラルネットワーク

画像は、オブジェクトや属性のコレクション以上のものであり、相互接続されたオブジェクト間の関係の網を表しています。シーングラフは、画像の構造化されたグラフィック表現のための新しいモダリティとして登場しました。シーングラフは、オブジェクトをエッジとしてペアワイズリレーションを介して接続されたノードとしてエンコードします。シーングラフでの質問応答をサポートするために、GraphVQA を提案します。これは、グラフノード間でやり取りされるメッセージの複数の反復として自然言語の質問を翻訳および実行する言語ガイド付きグラフニューラルネットワークフレームワークです。 GraphVQA フレームワークの設計空間を調査し、さまざまな設計の選択肢のトレードオフについて議論します。 GQA データセットに関する私たちの実験では、GraphVQA が最新モデルを大幅に上回っていることが示されています (88.43% 対 94.78%)。

Images are more than a collection of objects or attributes -- they represent a web of relationships among interconnected objects. Scene Graph has emerged as a new modality for a structured graphical representation of images. Scene Graph encodes objects as nodes connected via pairwise relations as edges. To support question answering on scene graphs, we propose GraphVQA, a language-guided graph neural network framework that translates and executes a natural language question as multiple iterations of message passing among graph nodes. We explore the design space of GraphVQA framework, and discuss the trade-off of different design choices. Our experiments on GQA dataset show that GraphVQA outperforms the state-of-the-art model by a large margin (88.43% vs. 94.78%).

updated: Wed Jun 02 2021 05:29:00 GMT+0000 (UTC)

published: Tue Apr 20 2021 23:54:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト