Traditional Chinese Synthetic Datasets Verified with Labeled Data for Scene Text Recognition

Yi-Chang Chen; Yu-Chuan Chang; Yen-Cheng Chang; Yi-Ren Yeh

シーンテキスト認識用のラベル付きデータで検証された繁体字中国語の合成データセット

シーンテキスト認識（STR）は、学界や産業界で広く研究されてきました。テキスト認識モデルのトレーニングには、多くの場合、大量のラベル付きデータが必要ですが、データのラベル付けは、特に繁体字中国語のテキスト認識では、困難、費用、または時間がかかる場合があります。私たちの知る限り、繁体字中国語のテキスト認識のための公開データセットは不足しています。この論文では、テキスト認識モデルのパフォーマンスを向上させることを目的とした繁体字中国語の合成データエンジンのフレームワークを紹介します。 2,000万を超える合成データを生成し、ベンチマークとして7,000を超える手動でラベル付けされたデータTC-STR7k-wordを収集しました。実験結果は、テキスト認識モデルが、生成された合成データを使用して最初からトレーニングするか、TC-STR 7k-wordを使用してさらに微調整することにより、はるかに高い精度を達成できることを示しています。

Scene text recognition (STR) has been widely studied in academia and industry. Training a text recognition model often requires a large amount of labeled data, but data labeling can be difficult, expensive, or time-consuming, especially for Traditional Chinese text recognition. To the best of our knowledge, public datasets for Traditional Chinese text recognition are lacking. This paper presents a framework for a Traditional Chinese synthetic data engine which aims to improve text recognition model performance. We generated over 20 million synthetic data and collected over 7,000 manually labeled data TC-STR 7k-word as the benchmark. Experimental results show that a text recognition model can achieve much better accuracy either by training from scratch with our generated synthetic data or by further fine-tuning with TC-STR 7k-word.

updated: Fri Nov 26 2021 06:27:06 GMT+0000 (UTC)

published: Fri Nov 26 2021 06:27:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト