Levenshtein OCR

Cheng Da; Peng Wang; Cong Yao

レーベンシュタインOCR

Levenshtein OCR

Vision-Language Transformer (VLT) に基づく新しいシーンテキスト認識エンジンが提示されます。 NLP の分野で Levenshtein Transformer に着想を得た提案された方法 (Levenshtein OCR、略して LevOCR と呼ばれる) は、トリミングされた自然画像からテキストコンテンツを自動的に転記するための代替方法を探ります。具体的には、シーンテキスト認識の問題を反復シーケンス改良プロセスとしてキャストします。純粋な視覚モデルによって生成された最初の予測シーケンスは、エンコードされてクロスモーダルトランスフォーマーに供給され、視覚的特徴と相互作用して融合し、グラウンドトゥルースを段階的に近似します。改良プロセスは、削除と挿入という 2 つの基本的な文字レベルの操作によって達成されます。削除と挿入は、模倣学習で学習され、並列デコード、動的な長さの変更、優れた解釈可能性を可能にします。定量的実験は、LevOCR が標準ベンチマークで最先端のパフォーマンスを達成することを明確に示し、定性分析は、提案された LevOCR アルゴリズムの有効性と利点を検証します。コードは近日公開予定です。

A novel scene text recognizer based on Vision-Language Transformer (VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the proposed method (named Levenshtein OCR, and LevOCR for short) explores an alternative way for automatically transcribing textual content from cropped natural images. Specifically, we cast the problem of scene text recognition as an iterative sequence refinement process. The initial prediction sequence produced by a pure vision model is encoded and fed into a cross-modal transformer to interact and fuse with the visual features, to progressively approximate the ground truth. The refinement process is accomplished via two basic character-level operations: deletion and insertion, which are learned with imitation learning and allow for parallel decoding, dynamic length change and good interpretability. The quantitative experiments clearly demonstrate that LevOCR achieves state-of-the-art performances on standard benchmarks and the qualitative analyses verify the effectiveness and advantage of the proposed LevOCR algorithm. Code will be released soon.

updated: Thu Sep 08 2022 06:46:50 GMT+0000 (UTC)

published: Thu Sep 08 2022 06:46:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト