Visual Question Answering based on Formal Logic

Muralikrishnna G. Sethuraman; Ali Payani; Faramarz Fekri; J. Clayton Kerce

正式な論理に基づく視覚的な質問応答

視覚的質問応答（VQA）は、複数のモダリティ（つまり、画像、言語）からの情報を理解する際に課せられる課題のために、近年、機械学習コミュニティで大きな注目を集めています。 VQAでは、一連の画像に基づいて一連の質問が提示され、当面のタスクは答えに到達することです。これを達成するために、形式論理のフレームワークを使用したシンボリック推論ベースのアプローチを採用しています。画像と質問は、明示的な推論が実行される記号表現に変換されます。（i）シーングラフを使用して画像を論理的背景事実に変換し、（ii）トランスベースの深層学習モデルを使用して質問を一階述語論理句に変換する正式な論理フレームワークを提案します。（iii）回答を得るために、背景知識と述語句の根拠を使用して、充足可能性チェックを実行します。私たちが提案する方法は非常に解釈可能であり、パイプラインの各ステップは人間が簡単に分析できます。 CLEVRとGQAデータセットでアプローチを検証します。最先端のモデルに匹敵するCLEVRデータセットで99.6％のほぼ完全な精度を達成し、形式論理が視覚的な質問応答に取り組むための実行可能なツールであることを示しています。私たちのモデルはデータ効率も高く、トレーニングデータのわずか10％でトレーニングした場合、CLEVRデータセットで99.1％の精度を達成します。

Visual question answering (VQA) has been gaining a lot of traction in the machine learning community in the recent years due to the challenges posed in understanding information coming from multiple modalities (i.e., images, language). In VQA, a series of questions are posed based on a set of images and the task at hand is to arrive at the answer. To achieve this, we take a symbolic reasoning based approach using the framework of formal logic. The image and the questions are converted into symbolic representations on which explicit reasoning is performed. We propose a formal logic framework where (i) images are converted to logical background facts with the help of scene graphs, (ii) the questions are translated to first-order predicate logic clauses using a transformer based deep learning model, and (iii) perform satisfiability checks, by using the background knowledge and the grounding of predicate clauses, to obtain the answer. Our proposed method is highly interpretable and each step in the pipeline can be easily analyzed by a human. We validate our approach on the CLEVR and the GQA dataset. We achieve near perfect accuracy of 99.6% on the CLEVR dataset comparable to the state of art models, showcasing that formal logic is a viable tool to tackle visual question answering. Our model is also data efficient, achieving 99.1% accuracy on CLEVR dataset when trained on just 10% of the training data.

updated: Mon Nov 08 2021 19:43:53 GMT+0000 (UTC)

published: Mon Nov 08 2021 19:43:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト