Denoising and Segmentation of Epigraphical Scripts

P Preethi; Hrishikesh Viswanath

碑文スクリプトのノイズ除去とセグメンテーション

この論文は、ハラリックの特徴を使用して画像のノイズを除去し、人工ニューラルネットワークを使用して文字をさらにセグメント化するための新しい方法のプレゼンテーションです。画像はカーネルに分割され、各カーネルはGLCM（Gray Level Co-Occurrence Matrix）に変換され、そこでハラリック特徴生成関数が呼び出されます。その結果、14個の特徴に対応する14個の要素を持つ配列になります。ハラリック値対応するノイズ/テキスト分類は辞書を形成し、カーネル比較を通じて画像のノイズを除去するために使用されます。セグメンテーションは、ドキュメントから文字を抽出するプロセスであり、文字が明示的な境界マーカーである空白で区切られている場合に使用できます。セグメンテーションは、多くの自然言語処理問題の最初のステップです。このホワイトペーパーでは、ニューラルネットワークを使用したセグメンテーションのプロセスについて説明します。ドキュメントの文字をセグメント化する方法は数多くありますが、このペーパーでは、ニューラルネットワークを使用してセグメント化する精度についてのみ説明します。文字を正しくセグメント化することが不可欠です。セグメント化しないと、自然言語処理ツールによる誤った認識につながるためです。人工ニューラルネットワークを使用して、最大89％の精度を達成しました。この方法は、文字が空白で区切られている言語に適しています。ただし、言語が接続された文字を多用する場合、この方法では許容できる結果が得られません。例としては、主にインド北部で使用されているデーバナーガリー文字があります。

This paper is a presentation of a new method for denoising images using Haralick features and further segmenting the characters using artificial neural networks. The image is divided into kernels, each of which is converted to a GLCM (Gray Level Co-Occurrence Matrix) on which a Haralick Feature generation function is called, the result of which is an array with fourteen elements corresponding to fourteen features The Haralick values and the corresponding noise/text classification form a dictionary, which is then used to de-noise the image through kernel comparison. Segmentation is the process of extracting characters from a document and can be used when letters are separated by white space, which is an explicit boundary marker. Segmentation is the first step in many Natural Language Processing problems. This paper explores the process of segmentation using Neural Networks. While there have been numerous methods to segment characters of a document, this paper is only concerned with the accuracy of doing so using neural networks. It is imperative that the characters be segmented correctly, for failing to do so will lead to incorrect recognition by Natural language processing tools. Artificial Neural Networks was used to attain accuracy of upto 89%. This method is suitable for languages where the characters are delimited by white space. However, this method will fail to provide acceptable results when the language heavily uses connected letters. An example would be the Devanagari script, which is predominantly used in northern India.

updated: Sun Jul 25 2021 13:25:08 GMT+0000 (UTC)

published: Sun Jul 25 2021 13:25:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト