On-Device Spatial Attention based Sequence Learning Approach for Scene Text Script Identification

Rutika Moharir; Arun D Prabhu; Sukumar Moharana; Gopi Ramena; Rachit S Munjal

シーンテキストスクリプト識別のためのデバイス上の空間的注意に基づくシーケンス学習アプローチ

スクリプトの自動識別は、多言語OCRエンジンの重要なコンポーネントです。このホワイトペーパーでは、リソースに制約のあるモバイルデバイスでの展開に適した、シーンテキストスクリプトの識別のための効率的で軽量なリアルタイムのデバイス上の空間的注意ベースのCNN-LSTMネットワークを紹介します。私たちのネットワークは、自然画像に存在する空間歪みを低減するのに役立つ空間注意モジュールを備えたCNNで構成されています。これにより、特徴抽出器は、変形を無視しながら豊富な画像表現を生成できるため、このきめ細かい分類タスクのパフォーマンスが向上します。ネットワークはまた、残差畳み込みブロックを使用して、スクリプトの識別機能に焦点を当てるための深いネットワークを構築します。 CNNは、各文字を特定のスクリプトに属するものとして識別することによってテキストの特徴表現を学習し、テキスト内の長期的な空間依存性は、LSTMレイヤーのシーケンス学習機能を使用してキャプチャされます。空間アテンションメカニズムと残差畳み込みブロックを組み合わせることで、ベースラインCNNのパフォーマンスを向上させ、スクリプト識別のためのエンドツーエンドのトレーニング可能なネットワークを構築できます。いくつかの標準ベンチマークでの実験結果は、私たちの方法の有効性を示しています。このネットワークは、最先端の方法で競争力のある精度を実現し、ネットワークサイズの点で優れており、合計でわずか110万のパラメーターと、2.7ミリ秒の推論時間です。

Automatic identification of script is an essential component of a multilingual OCR engine. In this paper, we present an efficient, lightweight, real-time and on-device spatial attention based CNN-LSTM network for scene text script identification, feasible for deployment on resource constrained mobile devices. Our network consists of a CNN, equipped with a spatial attention module which helps reduce the spatial distortions present in natural images. This allows the feature extractor to generate rich image representations while ignoring the deformities and thereby, enhancing the performance of this fine grained classification task. The network also employs residue convolutional blocks to build a deep network to focus on the discriminative features of a script. The CNN learns the text feature representation by identifying each character as belonging to a particular script and the long term spatial dependencies within the text are captured using the sequence learning capabilities of the LSTM layers. Combining the spatial attention mechanism with the residue convolutional blocks, we are able to enhance the performance of the baseline CNN to build an end-to-end trainable network for script identification. The experimental results on several standard benchmarks demonstrate the effectiveness of our method. The network achieves competitive accuracy with state-of-the-art methods and is superior in terms of network size, with a total of just 1.1 million parameters and inference time of 2.7 milliseconds.

updated: Wed Dec 01 2021 12:16:02 GMT+0000 (UTC)

published: Wed Dec 01 2021 12:16:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト