Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning

Christian Reul; Christoph Wick; Maximilian Nöth; Andreas Büttner; Maximilian Wehner; Uwe Springmann

すぐに使用できる認識と微調整のための歴史的なラテン文字に関する混合モデルOCRトレーニング

ラテン文字の履歴印刷に光学式文字認識（OCR）を完全に自動的に適用するために、適用時に文字エラー率（CER）が約2％のテキストを生成する広く適用可能なポリフォント認識モデルを構築する取り組みについて報告します。箱から出して。さらに、このモデルを、手動および計算の労力をほとんどかけずに、特定のクラスの印刷にさらに微調整する方法を示します。混合モデルまたはポリフォントモデルは、年齢（15世紀から19世紀）、タイポグラフィ（さまざまな種類のフラクトゥールとアンティカ）、言語（とりわけ、ドイツ語、ラテン語、フランス語）の観点から、さまざまな素材でトレーニングされています。）。結果を最適化するために、事前トレーニング、データ拡張、投票などのOCRトレーニングの確立された手法を組み合わせました。さらに、さまざまな前処理方法を使用して、トレーニングデータを充実させ、より堅牢なモデルを取得しました。また、最初に利用可能なすべてのかなり不均衡なデータをトレーニングし、次に選択されたよりバランスの取れたサブセットをトレーニングすることによって出力を改善する2段階のアプローチを実装しました。これまでに見たことのない29冊の本を評価した結果、CERは1.73％であり、広く使用されている標準モデルを上回り、CERは2.84％でほぼ40％でした。混合モデルから始めて、いくつかの目に見えない初期の現代ラテン語の本のために、より専門的なモデルをトレーニングすると、CERは1.47％になり、ゼロからのトレーニングと比較して最大50％、前述の標準モデルからのトレーニングと比較して最大30％向上しました。。私たちの新しい混合モデルは、コミュニティに公開されています。

In order to apply Optical Character Recognition (OCR) to historical printings of Latin script fully automatically, we report on our efforts to construct a widely-applicable polyfont recognition model yielding text with a Character Error Rate (CER) around 2% when applied out-of-the-box. Moreover, we show how this model can be further finetuned to specific classes of printings with little manual and computational effort. The mixed or polyfont model is trained on a wide variety of materials, in terms of age (from the 15th to the 19th century), typography (various types of Fraktur and Antiqua), and languages (among others, German, Latin, and French). To optimize the results we combined established techniques of OCR training like pretraining, data augmentation, and voting. In addition, we used various preprocessing methods to enrich the training data and obtain more robust models. We also implemented a two-stage approach which first trains on all available, considerably unbalanced data and then refines the output by training on a selected more balanced subset. Evaluations on 29 previously unseen books resulted in a CER of 1.73%, outperforming a widely used standard model with a CER of 2.84% by almost 40%. Training a more specialized model for some unseen Early Modern Latin books starting from our mixed model led to a CER of 1.47%, an improvement of up to 50% compared to training from scratch and up to 30% compared to training from the aforementioned standard model. Our new mixed model is made openly available to the community.

updated: Tue Jun 15 2021 04:51:54 GMT+0000 (UTC)

published: Tue Jun 15 2021 04:51:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト