SIMARA: a database for key-value information extraction from full pages

Solène Tarride; Mélodie Boillet; Jean-François Moufflet; Christopher Kermorvant

SIMARA: ページ全体からキー値情報を抽出するためのデータベース

歴史的な手書き文書から情報を抽出するための新しいデータベースを提案します。このコーパスには、18 世紀から 20 世紀にさかのぼる 6 つの異なるシリーズからの 5,393 の検索補助が含まれています。検索補助は、古いアーカイブを説明するメタデータを含む手書きのドキュメントです。それらはフランス国立公文書館に保管され、アーキビストがアーカイブ文書を識別して検索するために使用されます。各ドキュメントにはページレベルで注釈が付けられ、取得する 7 つのフィールドが含まれています。各フィールドのローカリゼーションは、このデータセットが情報抽出のためのセグメンテーションフリーシステムの研究を促進するような方法では利用できません。エンドツーエンドの情報抽出用にトレーニングされた Transformer アーキテクチャに基づくモデルを提案し、トレーニング、検証、およびテスト用の 3 つのセットを提供して、将来の作業との公正な比較を保証します。データベースは https://zenodo.org/record/7868059 で自由にアクセスできます。

We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents. Each document is annotated at page-level, and contains seven fields to retrieve. The localization of each field is not available in such a way that this dataset encourages research on segmentation-free systems for information extraction. We propose a model based on the Transformer architecture trained for end-to-end information extraction and provide three sets for training, validation and testing, to ensure fair comparison with future works. The database is freely accessible at https://zenodo.org/record/7868059.

updated: Wed Apr 26 2023 15:00:04 GMT+0000 (UTC)

published: Wed Apr 26 2023 15:00:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト