Robust Text Line Detection in Historical Documents: Learning and Evaluation Methods

Mélodie Boillet; Christopher Kermorvant; Thierry Paquet

歴史的文書におけるロバストなテキスト行の検出：学習と評価の方法

テキスト行のセグメンテーションは、歴史的文書を理解する上で重要なステップの1つです。さまざまなフォント、コンテンツ、文体、および長年にわたって劣化したドキュメントの品質のために、これは困難です。このホワイトペーパーでは、現在、一般化能力の高いラインセグメンテーションモデルを構築することを妨げている制限について説明します。 3つの最先端システムDoc-UFCN、dhSegment、およびARU-Netを使用して実施された調査を提示し、さまざまな未表示のページを正しくセグメント化できるさまざまな履歴ドキュメントデータセットでトレーニングされた汎用モデルを構築できることを示します。このホワイトペーパーでは、トレーニング中に使用される注釈の重要性についても説明します。既存の各データセットには異なる注釈が付けられます。注釈の統合を提示し、最終的なテキスト認識結果へのプラスの影響を示します。この目的のために、標準のピクセルレベルのメトリック、オブジェクトレベルのメトリックを使用し、目標指向のメトリックを導入する完全な評価戦略を提示します。

Text line segmentation is one of the key steps in historical document understanding. It is challenging due to the variety of fonts, contents, writing styles and the quality of documents that have degraded through the years. In this paper, we address the limitations that currently prevent people from building line segmentation models with a high generalization capacity. We present a study conducted using three state-of-the-art systems Doc-UFCN, dhSegment and ARU-Net and show that it is possible to build generic models trained on a wide variety of historical document datasets that can correctly segment diverse unseen pages. This paper also highlights the importance of the annotations used during training: each existing dataset is annotated differently. We present a unification of the annotations and show its positive impact on the final text recognition results. In this end, we present a complete evaluation strategy using standard pixel-level metrics, object-level ones and introducing goal-oriented metrics.

updated: Fri Oct 21 2022 08:29:06 GMT+0000 (UTC)

published: Wed Mar 23 2022 11:56:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト