CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

Tianlun Zheng; Zhineng Chen; Shancheng Fang; Hongtao Xie; Yu-Gang Jiang

CDistNet: 堅牢なテキスト認識のためのマルチドメイン文字距離の認識

Transformer ベースのエンコーダ/デコーダフレームワークは、視覚的領域と意味的領域の両方からの認識手がかりを自然に統合することが主な理由として、シーンテキスト認識で人気が高まっています。しかし、最近の研究では、2 種類の手がかりが常に適切に登録されているわけではないため、難しいテキスト (珍しい形状など) では特徴と文字がずれている可能性があることが示されています。その結果、この問題を軽減するために、文字の位置などの制約が導入されます。一定の成功にもかかわらず、ビジュアルとセマンティックは依然として個別にモデル化されており、単に緩やかに関連付けられているだけです。この論文では、視覚的および意味的に関連した位置の埋め込みを確立するために、マルチドメイン文字距離知覚 (MDCDP) と呼ばれる新しいモジュールを提案します。 MDCDP は、位置埋め込みを使用して、クロスアテンションメカニズムに従って視覚的特徴と意味的特徴の両方をクエリします。 2 種類の手がかりが位置ブランチに融合され、文字の間隔と向きの変化、文字の意味論的な類似性、および 2 種類の情報を結び付ける手がかりを適切に認識するコンテンツを意識した埋め込みが生成されます。それらはマルチドメイン文字距離としてまとめられます。私たちは、複数の MDCDP をスタックして、徐々に正確な距離モデリングをガイドする CDistNet を開発します。したがって、さまざまな認識困難があっても、特徴文字と文字の対応関係はうまく構築されています。私たちは、10 個の挑戦的な公開データセットと、私たち自身が作成した 2 つのシリーズの拡張データセットで CDistNet を検証します。実験は、CDistNet が非常に競争力のあるパフォーマンスを発揮することを示しています。標準ベンチマークでトップレベルにランクされているだけでなく、深刻なテキスト変形、貧弱な言語サポート、まれな文字レイアウトを示す実際のデータセットや拡張データセットでも、最近の一般的な手法を明らかに上回るパフォーマンスを示します。コードは https://github.com/simplify23/CDistNet で入手できます。

The Transformer-based encoder-decoder framework is becoming popular in scene text recognition, largely because it naturally integrates recognition clues from both visual and semantic domains. However, recent studies show that the two kinds of clues are not always well registered and therefore, feature and character might be misaligned in difficult text (e.g., with a rare shape). As a result, constraints such as character position are introduced to alleviate this problem. Despite certain success, visual and semantic are still separately modeled and they are merely loosely associated. In this paper, we propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visually and semantically related position embedding. MDCDP uses the position embedding to query both visual and semantic features following the cross-attention mechanism. The two kinds of clues are fused into the position branch, generating a content-aware embedding that well perceives character spacing and orientation variants, character semantic affinities, and clues tying the two kinds of information. They are summarized as the multi-domain character distance. We develop CDistNet that stacks multiple MDCDPs to guide a gradually precise distance modeling. Thus, the feature-character alignment is well built even various recognition difficulties are presented. We verify CDistNet on ten challenging public datasets and two series of augmented datasets created by ourselves. The experiments demonstrate that CDistNet performs highly competitively. It not only ranks top-tier in standard benchmarks, but also outperforms recent popular methods by obvious margins on real and augmented datasets presenting severe text deformation, poor linguistic support, and rare character layouts. Code is available at https://github.com/simplify23/CDistNet.

updated: Fri Aug 11 2023 03:17:54 GMT+0000 (UTC)

published: Mon Nov 22 2021 06:27:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト