A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution

Jianqi Ma; Zhetong Liang; Lei Zhang

空間変形のためのテキスト注意ネットワークロバストなシーンテキスト画像超解像

シーンテキスト画像の超解像は、低解像度画像のテキストの解像度と読みやすさを向上させることを目的としています。深い畳み込みニューラルネットワーク（CNN）によって大幅な改善が達成されましたが、空間的に変形したテキスト、特に回転した曲線形状のテキストの高解像度画像を再構築することは依然として困難です。これは、現在のCNNベースの方法が局所性ベースの操作を採用しているためです。これは、変形によって引き起こされる変動に対処するのに効果的ではありません。この論文では、この問題に対処するために、CNNベースのText ATTentionネットワーク（TATT）を提案します。テキストのセマンティクスは、最初にテキスト認識モジュールによってテキスト事前情報として抽出されます。次に、グローバルアテンションメカニズムを活用する新しいトランスベースのモジュールを設計し、テキスト再構築プロセスの前にテキストのセマンティックガイダンスを実行します。さらに、通常のテキストと変形されたテキストの再構成に構造的一貫性を課すことによって視覚的外観を洗練するために、テキスト構造の一貫性の喪失を提案します。ベンチマークTextZoomデータセットでの実験は、提案されたTATTが、PSNR / SSIMメトリックの観点から最先端のパフォーマンスを達成するだけでなく、特にマルチを使用するテキストインスタンスの場合、ダウンストリームのテキスト認識タスクでの認識精度を大幅に向上させることを示しています。 -向きと湾曲した形状。コードはhttps://github.com/mjq11302010044/TATTで入手できます。

Scene text image super-resolution aims to increase the resolution and readability of the text in low-resolution images. Though significant improvement has been achieved by deep convolutional neural networks (CNNs), it remains difficult to reconstruct high-resolution images for spatially deformed texts, especially rotated and curve-shaped ones. This is because the current CNN-based methods adopt locality-based operations, which are not effective to deal with the variation caused by deformations. In this paper, we propose a CNN based Text ATTention network (TATT) to address this problem. The semantics of the text are firstly extracted by a text recognition module as text prior information. Then we design a novel transformer-based module, which leverages global attention mechanism, to exert the semantic guidance of text prior to the text reconstruction process. In addition, we propose a text structure consistency loss to refine the visual appearance by imposing structural consistency on the reconstructions of regular and deformed texts. Experiments on the benchmark TextZoom dataset show that the proposed TATT not only achieves state-of-the-art performance in terms of PSNR/SSIM metrics, but also significantly improves the recognition accuracy in the downstream text recognition task, particularly for text instances with multi-orientation and curved shapes. Code is available at https://github.com/mjq11302010044/TATT.

updated: Thu Mar 17 2022 15:28:29 GMT+0000 (UTC)

published: Thu Mar 17 2022 15:28:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト