FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents

Guillaume Jaume; Hazim Kemal Ekenel; Jean-Philippe Thiran

FUNSD：ノイズの多いスキャン文書のフォームを理解するためのデータセット

ノイズの多いスキャンされたドキュメント（FUNSD）でフォームを理解するための新しいデータセットを提示します。これは、フォームのテキストコンテンツの抽出と構造化を目的としています。データセットは、199の実際の完全に注釈が付けられたスキャンされたフォームで構成されています。文書はノイズが多く、見た目も大きく異なるため、フォームの理解（FoUn）は困難な作業です。提案されたデータセットは、テキスト検出、光学式文字認識、空間レイアウト分析、エンティティのラベリング/リンクなど、さまざまなタスクに使用できます。私たちの知る限り、これはFoUnタスクに対処するための包括的なアノテーションを備えた、最初に公開されたデータセットです。また、一連のベースラインを提示し、FUNSDデータセットのパフォーマンスを評価するメトリックを紹介します。これは、https：//guillaumejaume.github.io/FUNSD/からダウンロードできます。

We present a new dataset for form understanding in noisy scanned documents (FUNSD) that aims at extracting and structuring the textual content of forms. The dataset comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking. To the best of our knowledge, this is the first publicly available dataset with comprehensive annotations to address FoUn task. We also present a set of baselines and introduce metrics to evaluate performance on the FUNSD dataset, which can be downloaded at https://guillaumejaume.github.io/FUNSD/.

updated: Tue Oct 29 2019 15:46:39 GMT+0000 (UTC)

published: Mon May 27 2019 10:40:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト