Check It Again: Progressive Visual Question Answering via Visual Entailment

Qingyi Si; Zheng Lin; Mingyu Zheng; Peng Fu; Weiping Wang

もう一度確認してください: 視覚的伴意による段階的な視覚的質問応答

洗練された視覚的質問応答モデルは目覚ましい成功を収めていますが、質問と回答の間の表面的な相関関係に従ってのみ質問に回答する傾向があります。この言語事前問題に対処するために、最近いくつかのアプローチが開発されています。ただし、ほとんどの場合、回答の信憑性をチェックせずに、1 つの最良の出力に従って正解を予測します。さらに、回答候補のセマンティクスを無視して、画像と質問の間の相互作用のみを調査します。この論文では、Visual Entailment に基づく Select-and-rerank (SAR) プログレッシブフレームワークを提案します。具体的には、最初に質問または画像に関連する回答候補を選択し、次に視覚的含意タスクによって回答候補を再ランク付けします。実験結果は、提案されたフレームワークの有効性を示しています。これにより、VQA-CP v2 で 7.55% 改善された最新の精度が確立されます。

While sophisticated Visual Question Answering models have achieved remarkable success, they tend to answer questions only according to superficial correlations between question and answer. Several recent approaches have been developed to address this language priors problem. However, most of them predict the correct answer according to one best output without checking the authenticity of answers. Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers. In this paper, we propose a select-and-rerank (SAR) progressive framework based on Visual Entailment. Specifically, we first select the candidate answers relevant to the question or the image, then we rerank the candidate answers by a visual entailment task, which verifies whether the image semantically entails the synthetic statement of the question and each candidate answer. Experimental results show the effectiveness of our proposed framework, which establishes a new state-of-the-art accuracy on VQA-CP v2 with a 7.55% improvement.

updated: Tue Jun 08 2021 18:00:38 GMT+0000 (UTC)

published: Tue Jun 08 2021 18:00:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト