Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

Ayan Kumar Bhunia; Aneeshan Sain; Amandeep Kumar; Shuvozit Ghose; Pinaki Nath Chowdhury; Yi-Zhe Song

共同視覚意味論的推論：テキスト認識のための多段階デコーダー

テキスト認識は何年にもわたって大幅に進化しましたが、最先端の（SOTA）モデルは、複雑な背景、さまざまなフォント、制御されていない照明、歪み、その他のアーティファクトのために、野生のシナリオで依然として苦労しています。これは、そのようなモデルがテキスト認識のために視覚情報のみに依存しているため、意味論的推論機能が不足しているためです。この論文では、意味情報は視覚のみに加えて補完的な役割を提供すると主張します。より具体的には、我々はさらに、共同の視覚的意味論的推論を実行する多段マルチスケール注意デコーダーを提案することによって意味論的情報を利用する。私たちの目新しさは、テキスト認識のために、予測を段階的に洗練する必要があるという直感にあります。したがって、私たちの主な貢献は、個別に予測された文字ラベルによって呼び出される非微分可能性をエンドツーエンドのトレーニングのためにバイパスする必要がある、段階的に展開する注意デコーダーの設計にあります。最初の段階では視覚的特徴を使用して予測しますが、後続の段階では、視覚と意味の共同情報を使用してその上で洗練されます。さらに、マルチスケールの2Dアテンションを導入し、さまざまなステージ間の密な接続と残余の接続を導入して、さまざまなスケールの文字サイズに対処し、トレーニング中のパフォーマンスと収束を高速化します。実験結果は、既存のSOTAメソッドをかなりの差で上回るためのアプローチを示しています。

Although text recognition has significantly evolved over the years, state-of-the-art (SOTA) models still struggle in the wild scenarios due to complex backgrounds, varying fonts, uncontrolled illuminations, distortions and other artefacts. This is because such models solely depend on visual information for text recognition, thus lacking semantic reasoning capabilities. In this paper, we argue that semantic information offers a complementary role in addition to visual only. More specifically, we additionally utilize semantic information by proposing a multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning. Our novelty lies in the intuition that for text recognition, the prediction should be refined in a stage-wise manner. Therefore our key contribution is in designing a stage-wise unrolling attentional decoder where non-differentiability, invoked by discretely predicted character labels, needs to be bypassed for end-to-end training. While the first stage predicts using visual features, subsequent stages refine on top of it using joint visual-semantic information. Additionally, we introduce multi-scale 2D attention along with dense and residual connections between different stages to deal with varying scales of character sizes, for better performance and faster convergence during training. Experimental results show our approach to outperform existing SOTA methods by a considerable margin.

updated: Mon Jul 26 2021 10:15:14 GMT+0000 (UTC)

published: Mon Jul 26 2021 10:15:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト