Weakly Supervised Scene Text Generation for Low-resource Languages

Yangchen Xie; Xinyuan Chen; Hongjian Zhan; Palaiahankote Shivakum; Bing Yin; Cong Liu; Yue Lu

低リソース言語向けの弱く教師ありのシーンテキスト生成

シーンテキスト認識モデルを適切にトレーニングするには、注釈付きのトレーニング画像を大量に用意することが重要です。ただし、十分なデータセットを収集することは、特にリソースが少ない言語の場合、労働集約的でコストのかかるプロセスになる可能性があります。この課題に対処するために、テキストデータの自動生成が問題を軽減する可能性を示しています。残念ながら、既存のシーンテキスト生成方法は通常、大量のペアデータに依存しており、リソースの少ない言語では取得が困難です。本稿では、いくつかの認識レベルのラベルを弱い監視として利用する、新しい弱監視されたシーンテキスト生成方法を提案します。提案手法は、クロスランゲージ生成により、多様な背景やフォントスタイルのシーンテキスト画像を大量に生成することができます。私たちの方法は、シーンのテキスト画像のコンテンツとスタイルの特徴を解きほぐし、前者はテキスト情報を表し、後者はフォント、配置、背景などの特性を表します。生成された画像の完全なコンテンツ構造を保存するために、統合されたアテンションモジュールを導入します。さらに、さまざまな言語のスタイルにおけるスタイルのギャップを埋めるために、事前トレーニングされたフォント分類子が組み込まれています。最先端のシーンテキスト認識モデルを使用してメソッドを評価します。実験では、生成されたシーンテキストがシーンテキストの認識精度を大幅に向上させ、他の生成方法と補完することでより高い精度を達成できることが実証されました。

A large number of annotated training images is crucial for training successful scene text recognition models. However, collecting sufficient datasets can be a labor-intensive and costly process, particularly for low-resource languages. To address this challenge, auto-generating text data has shown promise in alleviating the problem. Unfortunately, existing scene text generation methods typically rely on a large amount of paired data, which is difficult to obtain for low-resource languages. In this paper, we propose a novel weakly supervised scene text generation method that leverages a few recognition-level labels as weak supervision. The proposed method is able to generate a large amount of scene text images with diverse backgrounds and font styles through cross-language generation. Our method disentangles the content and style features of scene text images, with the former representing textual information and the latter representing characteristics such as font, alignment, and background. To preserve the complete content structure of generated images, we introduce an integrated attention module. Furthermore, to bridge the style gap in the style of different languages, we incorporate a pre-trained font classifier. We evaluate our method using state-of-the-art scene text recognition models. Experiments demonstrate that our generated scene text significantly improves the scene text recognition accuracy and help achieve higher accuracy when complemented with other generative methods.

updated: Tue Jun 27 2023 15:34:17 GMT+0000 (UTC)

published: Sun Jun 25 2023 15:26:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト