VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

Kang Chen; Xiangqian Wu

VTQA: エンティティアラインメントとクロスメディア推論によるビジュアルテキスト質問応答

視覚的質問応答の理想的な形は、視覚と言語の共同スペースでの理解、根拠、および推論を必要とし、シーン理解の AI タスクの代理として機能します。ただし、ほとんどの既存の VQA ベンチマークは、事前に定義された一連のオプションから回答を選択することに限定されており、テキストには注意を払っていません。 10124 の画像とテキストのペアに基づく 23,781 の質問を含むデータセットを使用して、新しい課題を提示します。具体的には、このタスクでは、モデルが同じエンティティのマルチメディア表現を整列させて、画像とテキスト間のマルチホップ推論を実装し、最終的に自然言語を使用して質問に答える必要があります。この課題の目的は、マルチメディアエンティティアラインメント、複数ステップの推論、自由な回答生成が可能なモデルを開発し、ベンチマークすることです。

The ideal form of Visual Question Answering requires understanding, grounding and reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most existing VQA benchmarks are limited to just picking the answer from a pre-defined set of options and lack attention to text. We present a new challenge with a dataset that contains 23,781 questions based on 10124 image-text pairs. Specifically, the task requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. The aim of this challenge is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation.

updated: Sun Mar 05 2023 10:32:26 GMT+0000 (UTC)

published: Sun Mar 05 2023 10:32:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト