DocFormerv2: Local Features for Document Understanding

Srikar Appalaraju; Peng Tang; Qi Dong; Nishant Sankaran; Yichu Zhou; R. Manmatha

DocFormerv2: ドキュメントを理解するためのローカル機能

私たちは、Visual Document Understanding (VDU) 用のマルチモーダルトランスフォーマーである DocFormerv2 を提案します。 VDU ドメインには、(単なる OCR 予測を超えた) ドキュメントの理解 (フォームからの情報の抽出、ドキュメントの VQA、その他のタスクなど) が含まれます。 VDU は、予測を行うために複数のモダリティ (視覚、言語、空間) を理解するモデルが必要なため、困難を伴います。 DocFormerv2 と呼ばれる私たちのアプローチは、視覚、言語、空間特徴を入力として受け取るエンコーダー/デコーダー変換器です。 DocFormerv2 は、非対称的に使用される教師なしタスク、つまりエンコーダー上の 2 つの新しいドキュメントタスクと自己回帰デコーダー上の 1 つの新しいドキュメントタスクを使用して事前トレーニングされています。教師なしタスクは、事前トレーニングによって複数のモダリティ間の局所特徴の調整が促進されるように慎重に設計されています。 DocFormerv2 を 9 つのデータセットで評価すると、TabFact (4.3%)、InfoVQA (1.4%)、FUNSD (1%) などの強力なベースラインを超えて最先端のパフォーマンスが示されます。さらに、汎化機能を示すために、シーンテキストを含む 3 つの VQA タスクにおいて、DocFormv2 は以前の同等サイズのモデルよりも優れたパフォーマンスを示し、一部のタスクでははるかに大きなモデル (GIT2、PaLi、Flamingo など) よりも優れたパフォーマンスを示します。広範なアブレーションは、事前トレーニングにより、DocFormerv2 が VDU における従来技術よりも複数のモダリティをよく理解していることを示しています。

We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine datasets shows state-of-the-art performance over strong baselines e.g. TabFact (4.3%), InfoVQA (1.4%), FUNSD (1%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, Doc- Formerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLi and Flamingo) on some tasks. Extensive ablations show that due to its pre-training, DocFormerv2 understands multiple modalities better than prior-art in VDU.

updated: Fri Jun 02 2023 17:58:03 GMT+0000 (UTC)

published: Fri Jun 02 2023 17:58:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト