CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

Tianlun Zheng; Zhineng Chen; Shancheng Fang; Hongtao Xie; Yu-Gang Jiang

CDistNet：ロバストなテキスト認識のためのマルチドメイン文字距離の認識

Transformerベースのエンコーダ-デコーダフレームワークは、シーンテキスト認識で人気が高まっています。これは主に、視覚ドメインとセマンティックドメインの両方からの認識の手がかりを自然に統合するためです。ただし、最近の調査によると、2種類の手がかりが常に適切に登録されているとは限らないため、難しいテキスト（たとえば、まれな形）では特徴と文字がずれている可能性があります。その結果、この問題を軽減するために、文字の位置などの制約が導入されます。一定の成功にもかかわらず、コンテンツのない位置埋め込みは、意味のあるローカル画像領域と安定して関連付けられることはほとんどありません。この論文では、視覚的および意味論に関連する位置符号化を確立するために、マルチドメイン文字距離知覚（MDCDP）と呼ばれる新しいモジュールを提案します。 MDCDPは、位置埋め込みを使用して、注意メカニズムに従って視覚的機能と意味的機能の両方を照会します。次に、2種類の制約付き機能を融合して強化機能を生成し、文字間の間隔の変化と意味的親和性、つまりマルチドメイン文字の距離を適切に認識するコンテンツ対応の埋め込みを生成します。複数のMDCDPをスタックして、徐々に正確な距離モデリングをガイドするCDistNetという名前の新しいネットワークを開発します。したがって、特徴と文字の位置合わせは、提示されたさまざまな認識の難しさでさえもうまく構築されています。認識の難しさが増す2つのシリーズの拡張データセットを作成し、CDistNetをそれらと6つの公開ベンチマークの両方に適用します。実験は、CDistNetが、困難な認識シナリオにおいて、最近の一般的な方法を大幅に上回っていることを示しています。また、標準ベンチマークで最先端の精度を実現します。さらに、視覚化により、CDistNetがビジュアルドメインとセマンティックドメインの両方で適切な情報利用を実現していることがわかります。私たちのコードはhttps://github.com/simplify23/CDistNetで提供されています。

The Transformer-based encoder-decoder framework is becoming popular in scene text recognition, largely because it naturally integrates recognition clues from both visual and semantic domains. However, recent studies show that the two kinds of clues are not always well registered and therefore, feature and character might be misaligned in the difficult text (e.g., with rare shapes). As a result, constraints such as character position are introduced to alleviate this problem. Despite certain success, a content-free positional embedding hardly associates stably with meaningful local image regions. In this paper, we propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visual and semantic related positional encoding. MDCDP uses positional embedding to query both visual and semantic features following the attention mechanism. The two kinds of constrained features are then fused to produce a reinforced feature, generating a content-aware embedding that well perceives spacing variations and semantic affinities among characters, i.e., multi-domain character distance. We develop a novel network named CDistNet that stacks multiple MDCDPs to guide a gradually precise distance modeling. Thus, the feature-character alignment is well built even various recognition difficulties presented. We create two series of augmented datasets with increasing recognition difficulties and apply CDistNet to both them and six public benchmarks. The experiments demonstrate that CDistNet outperforms recent popular methods by large margins in challenging recognition scenarios. It also achieves state-of-the-art accuracy on standard benchmarks. In addition, the visualization shows that CDistNet achieves proper information utilization in both visual and semantic domains. Our code is given in https://github.com/simplify23/CDistNet.

updated: Wed Jun 22 2022 00:21:12 GMT+0000 (UTC)

published: Mon Nov 22 2021 06:27:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト