Separating Content from Style Using Adversarial Learning for Recognizing Text in the Wild

Canjie Luo; Qingxiang Lin; Yuliang Liu; Lianwen Jin; Chunhua Shen

野生のテキストを認識するための敵対的学習を使用したコンテンツとスタイルの分離

複雑な背景からテキストコンテンツを分離することにより、新しい視点からのテキスト認識を改善することを提案します。バニラGANは自然画像でシーケンスのような文字を生成するのに十分に堅牢ではないため、画像内の複数の文字の生成と認識のための敵対的な学習フレームワークを提案します。提案されたフレームワークは、注意に基づく認識機能と生成的な敵対的なアーキテクチャで構成されています。さらに、ペアになっているトレーニングサンプルが不足しているという問題に取り組むために、認識器から弁別器にアテンションマスクを共有し、弁別器がさらなる敵の訓練のために各キャラクターの特徴を抽出できるようにするインタラクティブな共同訓練スキームを設計します。私たちのフレームワークは、キャラクターレベルの敵対的なトレーニングの恩恵を受けて、スタイルの監視に対になっていない単純なデータのみを必要とします。ランダムに選択された1人のキャラクターのみを含む各ターゲットスタイルサンプルは、トレーニング中にオンラインで簡単に合成できます。トレーニングには、費用のかかるペアのサンプルや文字レベルの注釈が必要ないため、これは重要です。したがって、入力画像と対応するテキストラベルのみが必要です。背景のスタイルの正規化に加えて、認識作業を容易にするために文字パターンを調整します。フィードバックメカニズムは、弁別器と認識器の間のギャップを埋めるために提案されています。したがって、識別器は、認識器の混乱に従って生成器を導くことができ、生成されたパターンがより明確に認識されます。通常のテキストと不規則なテキストの両方を含むさまざまなベンチマークでの実験により、この方法によって認識の難しさが大幅に軽減されることが示されています。私たちのフレームワークは、最新の認識精度を達成するために、最近の認識方法に統合することができます。

We propose to improve text recognition from a new perspective by separating the text content from complex backgrounds. As vanilla GANs are not sufficiently robust to generate sequence-like characters in natural images, we propose an adversarial learning framework for the generation and recognition of multiple characters in an image. The proposed framework consists of an attention-based recognizer and a generative adversarial architecture. Furthermore, to tackle the issue of lacking paired training samples, we design an interactive joint training scheme, which shares attention masks from the recognizer to the discriminator, and enables the discriminator to extract the features of each character for further adversarial training. Benefiting from the character-level adversarial training, our framework requires only unpaired simple data for style supervision. Each target style sample containing only one randomly chosen character can be simply synthesized online during the training. This is significant as the training does not require costly paired samples or character-level annotations. Thus, only the input images and corresponding text labels are needed. In addition to the style normalization of the backgrounds, we refine character patterns to ease the recognition task. A feedback mechanism is proposed to bridge the gap between the discriminator and the recognizer. Therefore, the discriminator can guide the generator according to the confusion of the recognizer, so that the generated patterns are clearer for recognition. Experiments on various benchmarks, including both regular and irregular text, demonstrate that our method significantly reduces the difficulty of recognition. Our framework can be integrated into recent recognition methods to achieve new state-of-the-art recognition accuracy.

updated: Sat Dec 12 2020 08:11:22 GMT+0000 (UTC)

published: Mon Jan 13 2020 12:41:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト