Scene Text Image Super-Resolution via Content Perceptual Loss and Criss-Cross Transformer Blocks

Rui Qin; Bin Wang; Yu-Wing Tai

コンテンツ知覚損失および十字型変換ブロックによるシーンテキスト画像の超解像

テキスト画像の超解像は、テキスト画像を人間が読みやすくするためのユニークで重要なタスクです。シーンテキスト認識の前処理として広く使用されています。ただし、自然のシーンでは複雑な劣化が発生するため、低解像度の入力から高解像度のテキストを復元することは、あいまいで困難です。既存の方法は主に、自然な画像再構成用に設計されたピクセル単位の損失でトレーニングされたディープニューラルネットワークを活用しており、テキストの固有の文字特性を無視しています。いくつかの作品は、コンテンツベースの損失を提案しました。ただし、それらはテキスト認識エンジンの精度にのみ焦点を当てており、再構築された画像は依然として人間にとってあいまいな場合があります。さらに、クロスランゲージを処理するための一般化可能性が弱いことがよくあります。この目的のために、クリスクロストランスフォーマーブロック（CCTB）と新しいコンテンツ知覚（CP）損失を使用して独自のテキスト特性を効果的に学習する、テキスト認識テキスト超解像度フレームワークである TATSR を紹介します。 CCTB は、2 つの直交変換器によってテキスト画像から垂直および水平のコンテンツ情報をそれぞれ抽出します。 CP Loss は、フレームワークにコンテンツ認識を効果的に組み込むマルチスケールテキスト認識機能によって、コンテンツセマンティクスを使用してテキスト再構築を監視します。さまざまな言語データセットに関する広範な実験により、TATSR が認識精度と人間の知覚の両方の点で最先端の方法よりも優れていることが実証されています。

Text image super-resolution is a unique and important task to enhance readability of text images to humans. It is widely used as pre-processing in scene text recognition. However, due to the complex degradation in natural scenes, recovering high-resolution texts from the low-resolution inputs is ambiguous and challenging. Existing methods mainly leverage deep neural networks trained with pixel-wise losses designed for natural image reconstruction, which ignore the unique character characteristics of texts. A few works proposed content-based losses. However, they only focus on text recognizers' accuracy, while the reconstructed images may still be ambiguous to humans. Further, they often have weak generalizability to handle cross languages. To this end, we present TATSR, a Text-Aware Text Super-Resolution framework, which effectively learns the unique text characteristics using Criss-Cross Transformer Blocks (CCTBs) and a novel Content Perceptual (CP) Loss. The CCTB extracts vertical and horizontal content information from text images by two orthogonal transformers, respectively. The CP Loss supervises the text reconstruction with content semantics by multi-scale text recognition features, which effectively incorporates content awareness into the framework. Extensive experiments on various language datasets demonstrate that TATSR outperforms state-of-the-art methods in terms of both recognition accuracy and human perception.

updated: Thu Oct 13 2022 11:48:45 GMT+0000 (UTC)

published: Thu Oct 13 2022 11:48:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト