DocVQA: A Dataset for VQA on Document Images

Minesh Mathew; Dimosthenis Karatzas; C. V. Jawahar

DocVQA：ドキュメント画像のVQAのデータセット

DocVQAと呼ばれるドキュメント画像の視覚的質問応答（VQA）用の新しいデータセットを提示します。データセットは、12,000以上のドキュメント画像で定義された50,000の質問で構成されています。 VQAと読解のための同様のデータセットと比較したデータセットの詳細な分析が提示されます。既存のVQAを採用し、読解モデルを使用して、いくつかのベースライン結果を報告します。既存のモデルは特定のタイプの質問でかなりうまく機能しますが、人間のパフォーマンスと比較して大きなパフォーマンスギャップがあります（94.36％の精度）。モデルは、ドキュメントの構造を理解することが重要である質問について特に改善する必要があります。データセット、コード、リーダーボードはhttp://cvit.iiit.ac.in/docvqa/で入手できます。

We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at http://cvit.iiit.ac.in/docvqa/

updated: Sun Dec 13 2020 04:13:51 GMT+0000 (UTC)

published: Wed Jul 01 2020 11:37:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト