Exploring Stroke-Level Modifications for Scene Text Editing

Yadong Qu; Qingfeng Tan; Hongtao Xie; Jianjun Xu; Yuxin Wang; Yongdong Zhang

シーンテキスト編集のためのストロークレベルの変更の調査

シーンテキスト編集 (STE) は、元のテキストの背景とスタイルを維持しながら、テキストを目的のテキストに置き換えることを目的としています。ただし、複雑な背景テクスチャとさまざまなテキストスタイルのため、既存の方法では、明確で読みやすい編集済みテキストイメージを生成するには不十分です。この研究では、編集パフォーマンスの低さは次の 2 つの問題に起因すると考えています。1) 暗黙的なデカップリング構造。画像全体を編集する以前の方法では、背景とテキスト領域の異なる変換ルールを同時に学習する必要がありました。 2) ドメインギャップ。編集された実際のシーンのテキスト画像がないため、ネットワークは合成ペアでのみ適切にトレーニングでき、実際の画像ではうまく機能しません。上記の問題を処理するために、ストロークレベルでシーンテキスト画像を修正する新しいネットワーク (MOSTEL) を提案します。まず、ストロークガイダンスマップを生成して、編集する領域を明示します。画像レベルですべてのピクセルを直接変更することによる暗黙的な指示とは異なり、このような明示的な指示は、背景から気を散らすものを除外し、ネットワークがテキスト領域の編集規則に集中するように導きます。次に、ラベル付けされた合成画像とペアになっていない実際のシーンのテキスト画像の両方でネットワークをトレーニングする半教師付きハイブリッド学習を提案します。したがって、STE モデルは実際のデータセットの分布に適応しています。さらに、公開評価データセットの空白を埋めるために、2 つの新しいデータセット (Tamper-Syn2k と Tamper-Scene) が提案されています。広範な実験により、当社の MOSTEL が以前の方法よりも質的にも量的にも優れていることが実証されています。データセットとコードは、https://github.com/qqqyd/MOSTEL で入手できます。

Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text. However, due to the complicated background textures and various text styles, existing methods fall short in generating clear and legible edited text images. In this study, we attribute the poor editing performance to two problems: 1) Implicit decoupling structure. Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously. 2) Domain gap. Due to the lack of edited real scene text images, the network can only be well trained on synthetic pairs and performs poorly on real-world images. To handle the above problems, we propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL). Firstly, we generate stroke guidance maps to explicitly indicate regions to be edited. Different from the implicit one by directly modifying all the pixels at image level, such explicit instructions filter out the distractions from background and guide the network to focus on editing rules of text regions. Secondly, we propose a Semi-supervised Hybrid Learning to train the network with both labeled synthetic images and unpaired real scene text images. Thus, the STE model is adapted to real-world datasets distributions. Moreover, two new datasets (Tamper-Syn2k and Tamper-Scene) are proposed to fill the blank of public evaluation datasets. Extensive experiments demonstrate that our MOSTEL outperforms previous methods both qualitatively and quantitatively. Datasets and code will be available at https://github.com/qqqyd/MOSTEL.

updated: Mon Dec 05 2022 02:10:59 GMT+0000 (UTC)

published: Mon Dec 05 2022 02:10:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト