ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting

Shancheng Fang; Zhendong Mao; Hongtao Xie; Yuxin Wang; Chenggang Yan; Yongdong Zhang

ABINet++: シーンテキストスポッティングのための自律的、双方向、反復型言語モデリング

シーンテキストスポッティングは、さまざまな用途があるため、コンピュータービジョンコミュニティにとって非常に重要です。最近の方法は、純粋な視覚的分類ではなく、認識に挑戦するための言語知識を導入しようとしています。ただし、エンドツーエンドのディープネットワークで言語ルールを効果的にモデル化する方法は、研究課題のままです。この論文では、言語モデルの限られた容量は、1) 暗黙の言語モデリングに起因すると主張します。 2) 一方向の特徴表現。 3) ノイズ入力を伴う言語モデル。これに対応して、シーンテキストスポッティング用の自律的で双方向の反復型 ABINet++ を提案します。まず、自律型は、認識機能を視覚モデルと言語モデルに分離し、両方のモデル間の勾配の流れをブロックすることにより、言語モデリングを明示的に実施することを提案します。第二に、言語モデルとしての新しい双方向クローズネットワーク（BCN）が、双方向の特徴表現に基づいて提案されています。第三に、ノイズ入力の影響を効果的に軽減できる言語モデルの反復修正の実行方法を提案します。最後に、長いテキスト認識で ABINet++ を洗練するために、U-Net 内に Transformer ユニットを埋め込むことによって水平方向の特徴を集約し、文字の順序と内容を統合して文字の特徴に正確に注意を向ける位置と内容の注意モジュールを設計することを提案します。 ABINet++ は、シーンテキスト認識とシーンテキストスポッティングベンチマークの両方で最先端のパフォーマンスを達成します。これは、さまざまな環境、特に低品質の画像での方法の優位性を一貫して示しています。さらに、英語や中国語を含む広範な実験により、言語モデリング手法を組み込んだテキストスポッターは、一般的に使用されている注意ベースの認識エンジンと比較して、精度と速度の両方でパフォーマンスを大幅に向上させることができることも証明されています。

Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how to effectively model the linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting. Firstly, the autonomous suggests enforcing explicitly language modeling by decoupling the recognizer into vision model and language model and blocking gradient flow between both models. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for the language model which can effectively alleviate the impact of noise input. Finally, to polish ABINet++ in long text recognition, we propose to aggregate horizontal features by embedding Transformer units inside a U-Net, and design a position and content attention module which integrates character order and content to attend to character features precisely. ABINet++ achieves state-of-the-art performance on both scene text recognition and scene text spotting benchmarks, which consistently demonstrates the superiority of our method in various environments especially on low-quality images. Besides, extensive experiments including in English and Chinese also prove that, a text spotter that incorporates our language modeling method can significantly improve its performance both in accuracy and speed compared with commonly used attention-based recognizers.

updated: Sat Nov 19 2022 03:50:33 GMT+0000 (UTC)

published: Sat Nov 19 2022 03:50:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト