DocParser: End-to-end OCR-free Information Extraction from Visually Rich Documents

Mohamed Dhouib; Ghassen Bettaieb; Aymen Shabou

DocParser: 視覚的にリッチなドキュメントからのエンドツーエンドの OCR フリーの情報抽出

視覚的にリッチなドキュメントからの情報抽出は、いくつかのドキュメントコントロールベースのアプリケーションでの重要性と幅広い商業的価値により、近年多くの注目を集めている困難なタスクです。これまでにこのトピックに関して実施された研究作業の大部分は、2 段階のパイプラインに従います。まず、市販の光学式文字認識 (OCR) エンジンを使用してテキストを読み取り、取得したテキストから対象のフィールドを抽出します。これらのアプローチの主な欠点は、外部の OCR システムに依存することであり、パフォーマンスと計算速度の両方に悪影響を及ぼす可能性があります。以前の問題に対処するために、最近の OCR を使用しない方法が提案されました。彼らの有望な結果に着想を得て、この論文では、DocParser という名前の OCR を使用しないエンドツーエンドの情報抽出モデルを提案します。これは、識別可能な文字の特徴をより適切に抽出できるという点で、以前のエンドツーエンドのアプローチとは異なります。 DocParser は、さまざまなデータセットで最先端の結果を達成しながら、以前の作業よりも高速です。

Information Extraction from visually rich documents is a challenging task that has gained a lot of attention in recent years due to its importance in several document-control based applications and its widespread commercial value. The majority of the research work conducted on this topic to date follow a two-step pipeline. First, they read the text using an off-the-shelf Optical Character Recognition (OCR) engine, then, they extract the fields of interest from the obtained text. The main drawback of these approaches is their dependence on an external OCR system, which can negatively impact both performance and computational speed. Recent OCR-free methods were proposed to address the previous issues. Inspired by their promising results, we propose in this paper an OCR-free end-to-end information extraction model named DocParser. It differs from prior end-to-end approaches by its ability to better extract discriminative character features. DocParser achieves state-of-the-art results on various datasets, while still being faster than previous works.

updated: Mon May 01 2023 21:09:08 GMT+0000 (UTC)

published: Mon Apr 24 2023 22:48:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト