Towards Complex Document Understanding By Discrete Reasoning

Fengbin Zhu; Wenqiang Lei; Fuli Feng; Chao Wang; Haozhou Zhang; Tat-Seng Chua

離散的推論による複雑な文書理解に向けて

Document Visual Question Answering（VQA）は、視覚的に豊富なドキュメントを理解して、自然言語で質問に回答することを目的としています。これは、自然言語処理とコンピュータービジョンの両方の新しい研究トピックです。この作業では、TAT-DQAという名前の新しいドキュメントVQAデータセットを紹介します。これは、半構造化テーブルと非構造化テキストで構成される3,067のドキュメントページと、TAT-QAデータセットを拡張することによる16,558の質問と回答のペアで構成されます。これらのドキュメントは実際の財務レポートからサンプリングされており、多くの数値が含まれています。つまり、このデータセットの質問に答えるには、個別の推論機能が必要です。 TAT-DQAに基づいて、テキスト、レイアウト、視覚的画像などのマルチモダリティの情報を考慮したMHSTという名前の新しいモデルをさらに開発し、対応する戦略、つまり抽出または推論を使用してさまざまなタイプの質問にインテリジェントに対処します。広範な実験により、MHSTモデルはベースライン手法を大幅に上回っており、その有効性が実証されています。ただし、パフォーマンスは依然として熟練した人間のパフォーマンスよりもはるかに遅れています。新しいTAT-DQAデータセットは、特に個別の推論が必要なシナリオで、視覚と言語を組み合わせた視覚的に豊富なドキュメントの深い理解に関する研究を促進することを期待しています。また、提案されたモデルが、研究者に将来、より高度なドキュメントVQAモデルを設計するきっかけとなることを願っています。

Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language, which is an emerging research topic for both Natural Language Processing and Computer Vision. In this work, we introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages comprising semi-structured table(s) and unstructured text as well as 16,558 question-answer pairs by extending the TAT-QA dataset. These documents are sampled from real-world financial reports and contain lots of numbers, which means discrete reasoning capability is demanded to answer questions on this dataset. Based on TAT-DQA, we further develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions with corresponding strategies, i.e., extraction or reasoning. Extensive experiments show that the MHST model significantly outperforms the baseline methods, demonstrating its effectiveness. However, the performance still lags far behind that of expert humans. We expect that our new TAT-DQA dataset would facilitate the research on deep understanding of visually-rich documents combining vision and language, especially for scenarios that require discrete reasoning. Also, we hope the proposed model would inspire researchers to design more advanced Document VQA models in future.

updated: Mon Jul 25 2022 01:43:19 GMT+0000 (UTC)

published: Mon Jul 25 2022 01:43:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト