Scene Text Synthesis for Efficient and Effective Deep Network Training

Changgong Zhang; Fangneng Zhan; Hongyuan Zhu; Shijian Lu

効率的かつ効果的なディープネットワークトレーニングのためのシーンテキスト合成

正確で堅牢なディープネットワークモデルのトレーニングには、大量の注釈付きトレーニング画像が不可欠ですが、大量の注釈付きトレーニング画像の収集には多くの場合、時間とコストがかかります。画像合成は、最近の深層学習研究でますます関心を集めている機械によって自動的に注釈付きトレーニング画像を生成することにより、この制約を緩和します。背景画像に前景オブジェクト (OOI) を現実的に埋め込むことにより、注釈付きトレーニング画像を構成する革新的な画像合成技術を開発します。提案された手法は、原則として、ディープネットワークトレーニングにおける合成画像の有用性を高める 2 つの重要なコンポーネントで構成されています。 1 つ目は、OOI が背景画像内の意味的に一貫した領域の周囲に配置されるようにする、コンテキスト認識型の意味的一貫性です。 2 つ目は、ジオメトリの配置と外観のリアリズムの両方から、埋め込まれた OOI が周囲の背景に調和するようにする、調和のとれた外観の適応です。提案された手法は、2 つの関連するが非常に異なるコンピュータービジョンの課題、すなわちシーンテキスト検出とシーンテキスト認識で評価されています。多数の公開データセットに対する実験は、提案された画像合成技術の有効性を示しています。ディープネットワークトレーニングで合成画像を使用すると、実際の画像を使用する場合と同様またはそれ以上のシーンテキスト検出およびシーンテキスト認識パフォーマンスを達成できます。

A large amount of annotated training images is critical for training accurate and robust deep network models but the collection of a large amount of annotated training images is often time-consuming and costly. Image synthesis alleviates this constraint by generating annotated training images automatically by machines which has attracted increasing interest in the recent deep learning research. We develop an innovative image synthesis technique that composes annotated training images by realistically embedding foreground objects of interest (OOI) into background images. The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training. The first is context-aware semantic coherence which ensures that the OOI are placed around semantically coherent regions within the background image. The second is harmonious appearance adaptation which ensures that the embedded OOI are agreeable to the surrounding background from both geometry alignment and appearance realism. The proposed technique has been evaluated over two related but very different computer vision challenges, namely, scene text detection and scene text recognition. Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique - the use of our synthesized images in deep network training is capable of achieving similar or even better scene text detection and scene text recognition performance as compared with using real images.

updated: Mon Apr 24 2023 12:35:49 GMT+0000 (UTC)

published: Sat Jan 26 2019 10:15:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト