Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition

Cheng Da; Peng Wang; Cong Yao

シーンテキスト認識のための学習可能な融合による多粒度予測

シーンテキスト認識 (STR) は、膨大な技術的課題と広範な応用のため、コンピュータービジョンにおいて長年にわたって活発な研究テーマとなってきました。この困難な問題に対処するために、数多くの革新的な手法が次々に提案されており、最近では言語知識を STR モデルに組み込むことが顕著な傾向となっています。この作業では、まず、Vision Transformer (ViT) の最近の進歩からインスピレーションを得て、概念的にシンプルでありながら機能的に強力なビジョン STR モデルを構築します。このモデルは、ViT とカスタマイズされた Adaptive Addressing and Aggregation (A^3) モジュールに基づいて構築されています。これは、純粋な視覚モデルと言語拡張手法の両方を含む、シーンテキスト認識に関するこれまでのほとんどの最先端モデルをすでに上回っています。言語知識を統合するために、言語モダリティからの情報を暗黙的な方法でモデルに注入する多粒度予測戦略をさらに提案します。つまり、NLP で広く使用されているサブワード表現 (BPE および WordPiece) が出力空間に導入されます。従来の文字レベル表現に加えて、独立言語モデル (LM) は採用されていません。最終的な認識結果を生成するために、複数粒度の予測を効果的に融合するための 2 つの戦略が考案されています。結果として得られるアルゴリズム (MGP-STR と呼ばれる) は、STR のパフォーマンス範囲をさらに高いレベルに押し上げることができます。具体的には、MGP-STR は、シーンテキスト認識の標準ベンチマークで平均 94% の認識精度を達成しています。さらに、広く使用されている手書きベンチマークや、より困難なシーンテキストデータセットでも最先端の結果を達成し、提案された MGP-STR アルゴリズムの汎用性を実証しています。ソースコードとモデルは、https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR から入手できます。

Due to the enormous technical challenges and wide range of applications, scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this tough problem, numerous innovative methods have been successively proposed, and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet functionally powerful vision STR model, which is built upon ViT and a tailored Adaptive Addressing and Aggregation (A^3) module. It already outperforms most previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. To produce the final recognition results, two strategies for effectively fusing the multi-granularity predictions are devised. The resultant algorithm (termed MGP-STR) is able to push the performance envelope of STR to an even higher level. Specifically, MGP-STR achieves an average recognition accuracy of 94% on standard benchmarks for scene text recognition. Moreover, it also achieves state-of-the-art results on widely-used handwritten benchmarks as well as more challenging scene text datasets, demonstrating the generality of the proposed MGP-STR algorithm. The source code and models will be available at: https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR.

updated: Tue Jul 25 2023 04:12:50 GMT+0000 (UTC)

published: Tue Jul 25 2023 04:12:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト