uTHCD: A New Benchmarking for Tamil Handwritten OCR

Noushath Shaffi; Faizal Hajamohideen

uTHCD：タミル語手書きOCRの新しいベンチマーク

手書き文字認識は、大きな書き込みスタイルのバリエーション、データに固有のノイズ、それが提供する広範なアプリケーション、ベンチマークデータベースが利用できないなどの多くの理由により、何十年にもわたってドキュメント画像分析の分野で挑戦的な研究です。いくつかのインドのスクリプトのデータベースの作成に関する文献で報告されている作業ですが、タミルのスクリプトは1つのデータベースでしか報告されていないため、まだ初期段階にあります[5]。この論文では、網羅的で大規模な制約のないタミル語手書き文字データベース（uTHCD）の作成で行われた作業を紹介します。データベースは約91000のサンプルで構成され、156のクラスのそれぞれに約600のサンプルがあります。データベースは、オンラインとオフラインの両方のサンプルの統合されたコレクションです。オフラインサンプルは、指定されたグリッド内のフォームにサンプルを書き込むようにボランティアに依頼することによって収集されました。オンラインサンプルでは、ボランティアにデジタル手書きパッドを使用して同様のグリッドで書き込みをさせました。収集されたサンプルには、多種多様な文体、オフラインスキャンプロセスから生じる固有の歪み、つまりストロークの不連続性、ストロークの太さの変化、歪みなどが含まれます。このようなデータに耐性のあるアルゴリズムは、リアルタイムアプリケーションに実際に展開できます。サンプルは、学校に通う子供、主婦、大学生、教職員を含む約650人のネイティブタミル人ボランティアから生成されました。分離された文字データベースは、生の画像および階層データファイル（HDF）圧縮ファイルとして公開されます。このデータベースを使用して、タミル語の手書き文字認識に新しいベンチマークを設定し、ドキュメント画像分析ドメインの多くの手段の出発点として機能することを期待しています。また、この論文では、畳み込みニューラルネットワーク（CNN）のデータベースを使用した、テストデータのベースライン精度が88％である理想的な実験セットアップについても説明しています。

Handwritten character recognition is a challenging research in the field of document image analysis over many decades due to numerous reasons such as large writing styles variation, inherent noise in data, expansive applications it offers, non-availability of benchmark databases etc. There has been considerable work reported in literature about creation of the database for several Indic scripts but the Tamil script is still in its infancy as it has been reported only in one database [5]. In this paper, we present the work done in the creation of an exhaustive and large unconstrained Tamil Handwritten Character Database (uTHCD). Database consists of around 91000 samples with nearly 600 samples in each of 156 classes. The database is a unified collection of both online and offline samples. Offline samples were collected by asking volunteers to write samples on a form inside a specified grid. For online samples, we made the volunteers write in a similar grid using a digital writing pad. The samples collected encompass a vast variety of writing styles, inherent distortions arising from offline scanning process viz stroke discontinuity, variable thickness of stroke, distortion etc. Algorithms which are resilient to such data can be practically deployed for real time applications. The samples were generated from around 650 native Tamil volunteers including school going kids, homemakers, university students and faculty. The isolated character database will be made publicly available as raw images and Hierarchical Data File (HDF) compressed file. With this database, we expect to set a new benchmark in Tamil handwritten character recognition and serve as a launchpad for many avenues in document image analysis domain. Paper also presents an ideal experimental set-up using the database on convolutional neural networks (CNN) with a baseline accuracy of 88% on test data.

updated: Sat Mar 13 2021 10:34:08 GMT+0000 (UTC)

published: Sat Mar 13 2021 10:34:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト