Digitizing Historical Balance Sheet Data: A Practitioner's Guide

Sergio Correia; Stephan Luck

過去の貸借対照表データのデジタル化：実務者ガイド

このホワイトペーパーでは、前処理と後処理の方法で光学式文字認識（OCR）エンジンを強化することにより、大規模な履歴マイクロデータを正常にデジタル化する方法について説明します。 OCRソフトウェアは、機械学習の改善により近年劇的に改善されましたが、既製のOCRアプリケーションは依然として高いエラー率を示し、構造化された情報を正確に抽出するためのアプリケーションを制限しています。ただし、OCRを追加の方法で補完すると、成功率が劇的に向上し、経済史家にとって強力で費用効果の高いツールになります。このホワイトペーパーでは、これらの方法を紹介し、それらが役立つ理由を説明します。それらを2つの大きな貸借対照表データセットに適用し、これらのメソッドを統合フレームワークに含むPythonパッケージであるquipucamayocを導入します。

This paper discusses how to successfully digitize large-scale historical micro-data by augmenting optical character recognition (OCR) engines with pre- and post-processing methods. Although OCR software has improved dramatically in recent years due to improvements in machine learning, off-the-shelf OCR applications still present high error rates which limit their applications for accurate extraction of structured information. Complementing OCR with additional methods can however dramatically increase its success rate, making it a powerful and cost-efficient tool for economic historians. This paper showcases these methods and explains why they are useful. We apply them against two large balance sheet datasets and introduce quipucamayoc, a Python package containing these methods in a unified framework.

updated: Sun Jul 24 2022 18:43:42 GMT+0000 (UTC)

published: Thu Mar 31 2022 19:18:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト