CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

Tianlun Zheng; Zhineng Chen; Shancheng Fang; Hongtao Xie; Yu-Gang Jiang

CDistNet：堅牢なテキスト認識のためのマルチドメイン文字距離の認識

注意ベースのエンコーダ-デコーダフレームワークは、主に視覚的ドメインと意味的ドメインの両方からの認識の手がかりを統合することにおけるその優位性のために、シーンテキスト認識で人気が高まっています。ただし、最近の調査では、2つの手がかりが難しいテキスト（たとえば、まれなテキストの形状）でずれている可能性があり、問題を軽減するために文字の位置などの制約が導入されている可能性があります。一定の成功にもかかわらず、コンテンツのない位置埋め込みは、意味のあるローカル画像領域と安定して関連付けられることはほとんどありません。この論文では、視覚的および意味論的に関連する位置符号化を確立するために、マルチドメイン文字距離知覚（MDCDP）と呼ばれる新しいモジュールを提案します。 MDCDPは、位置埋め込みを使用して、注意メカニズムに従って視覚的機能と意味的機能の両方を照会します。これは、文字間の視覚的距離と意味的距離の両方を表す位置の手がかりを自然にエンコードします。正確な距離モデリングをガイドするためにMDCDPを数回スタックするCDistNetという名前の新しいアーキテクチャを開発します。したがって、視覚的意味論的アラインメントは、提示されたさまざまな困難でさえうまく構築されています。 CDistNetを2つの拡張データセットと6つの公開ベンチマークに適用します。実験は、CDistNetが最先端の認識精度を達成することを示しています。視覚化はまた、CDistNetが視覚的ドメインと意味的ドメインの両方で適切な注意のローカリゼーションを達成していることを示しています。承認次第、コードをリリースします。

The attention-based encoder-decoder framework is becoming popular in scene text recognition, largely due to its superiority in integrating recognition clues from both visual and semantic domains. However, recent studies show the two clues might be misaligned in the difficult text (e.g., with rare text shapes) and introduce constraints such as character position to alleviate the problem. Despite certain success, a content-free positional embedding hardly associates with meaningful local image regions stably. In this paper, we propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visual and semantic related position encoding. MDCDP uses positional embedding to query both visual and semantic features following the attention mechanism. It naturally encodes the positional clue, which describes both visual and semantic distances among characters. We develop a novel architecture named CDistNet that stacks MDCDP several times to guide precise distance modeling. Thus, the visual-semantic alignment is well built even various difficulties presented. We apply CDistNet to two augmented datasets and six public benchmarks. The experiments demonstrate that CDistNet achieves state-of-the-art recognition accuracy. While the visualization also shows that CDistNet achieves proper attention localization in both visual and semantic domains. We will release our code upon acceptance.

updated: Mon Nov 22 2021 06:27:29 GMT+0000 (UTC)

published: Mon Nov 22 2021 06:27:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト