Towards Complex Document Understanding By Discrete Reasoning

Fengbin Zhu; Wenqiang Lei; Fuli Feng; Chao Wang; Haozhou Zhang; Tat-Seng Chua

離散推論による複雑な文書理解に向けて

Document Visual Question Answering (VQA) は、視覚的にリッチなドキュメントを理解し、自然言語で質問に回答することを目的としています。これは、自然言語処理とコンピュータービジョンの両方の新しい研究トピックです。この作業では、TAT-DQA という名前の新しい Document VQA データセットを導入します。これは、TAT-QA データセットを拡張することにより、半構造化テーブルと非構造化テキスト、および 16,558 の質問と回答のペアを含む 3,067 ドキュメントページで構成されます。これらのドキュメントは実際の財務レポートからサンプリングされており、多数の数値が含まれています。つまり、このデータセットに関する質問に答えるには、個別の推論能力が必要です。 TAT-DQAに基づいて、テキスト、レイアウト、ビジュアルイメージなどのマルチモダリティの情報を考慮して、対応する戦略、つまり抽出または推論でさまざまなタイプの質問にインテリジェントに対処するMHSTという新しいモデルをさらに開発します。広範な実験により、MHST モデルがベースライン手法よりも大幅に優れていることが示され、その有効性が実証されています。しかし、そのパフォーマンスはまだ熟練した人間のパフォーマンスに大きく遅れをとっています。私たちの新しい TAT-DQA データセットは、視覚と言語を組み合わせた視覚的に豊かなドキュメントの深い理解に関する研究を促進するものと期待しています。また、提案されたモデルが、研究者が将来的により高度な Document VQA モデルを設計するきっかけになることを願っています。私たちのデータセットは、https://nextplusplus.github.io/TAT-DQA/ で非営利目的で公開されます。

Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language, which is an emerging research topic for both Natural Language Processing and Computer Vision. In this work, we introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages comprising semi-structured table(s) and unstructured text as well as 16,558 question-answer pairs by extending the TAT-QA dataset. These documents are sampled from real-world financial reports and contain lots of numbers, which means discrete reasoning capability is demanded to answer questions on this dataset. Based on TAT-DQA, we further develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions with corresponding strategies, i.e., extraction or reasoning. Extensive experiments show that the MHST model significantly outperforms the baseline methods, demonstrating its effectiveness. However, the performance still lags far behind that of expert humans. We expect that our new TAT-DQA dataset would facilitate the research on deep understanding of visually-rich documents combining vision and language, especially for scenarios that require discrete reasoning. Also, we hope the proposed model would inspire researchers to design more advanced Document VQA models in future. Our dataset will be publicly available for non-commercial use at https://nextplusplus.github.io/TAT-DQA/.

updated: Wed Sep 07 2022 14:36:01 GMT+0000 (UTC)

published: Mon Jul 25 2022 01:43:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト