Self-supervised Character-to-Character Distillation for Text Recognition

Tongkun Guan; Wei Shen; Xue Yang; Qi Feng; Zekun Jiang

テキスト認識のための自己管理型文字変換

複雑なテキストイメージ (たとえば、不規則な構造、低解像度、重度のオクルージョン、不均一な照明) を処理する場合、既存の教師ありテキスト認識方法はデータを大量に消費します。これらの方法は大規模な合成テキスト画像を使用して注釈付きの実画像への依存を減らしますが、ドメインギャップは依然として認識パフォーマンスを制限します。したがって、自己教師あり学習によってラベルのない実際の画像で堅牢なテキスト機能表現を探索することは、良い解決策です。ただし、既存の自己教師付きテキスト認識方法は、水平軸に沿って視覚的特徴を大まかに分割することにより、シーケンスからシーケンスへの表現学習を実行します。これにより、拡張の柔軟性が制限されます。機能の矛盾。これに動機付けられて、一般的なテキスト表現の学習を促進するための汎用性の高い増強を可能にする、新しい自己教師付き文字から文字への蒸留法、CCD を提案します。具体的には、自己教師付き文字セグメンテーションモジュールを設計することにより、ラベルのない実際の画像の文字構造を描写します。これに続いて、CCD は、画像からの 2 つの拡張ビュー間の変換行列を使用して、柔軟な拡張の下でペアワイズアラインメントを維持しながら、ローカルキャラクターの多様性を簡単に豊かにします。実験では、CCD が最先端の結果を達成し、テキスト認識で 1.38%、テキストセグメンテーションで 1.7%、テキスト超解像で 0.24 dB (PSNR) および 0.0321 (SSIM) の平均パフォーマンス向上が得られることが実証されています。コードは近日公開予定です。

When handling complicated text images (e.g., irregular structures, low resolution, heavy occlusion, and uneven illumination), existing supervised text recognition methods are data-hungry. Although these methods employ large-scale synthetic text images to reduce the dependence on annotated real images, the domain gap still limits the recognition performance. Therefore, exploring the robust text feature representations on unlabeled real images by self-supervised learning is a good solution. However, existing self-supervised text recognition methods conduct sequence-to-sequence representation learning by roughly splitting the visual features along the horizontal axis, which limits the flexibility of the augmentations, as large geometric-based augmentations may lead to sequence-to-sequence feature inconsistency. Motivated by this, we propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate general text representation learning. Specifically, we delineate the character structures of unlabeled real images by designing a self-supervised character segmentation module. Following this, CCD easily enriches the diversity of local characters while keeping their pairwise alignment under flexible augmentations, using the transformation matrix between two augmented views from images. Experiments demonstrate that CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution. Code will be released soon.

updated: Wed Mar 22 2023 07:03:38 GMT+0000 (UTC)

published: Tue Nov 01 2022 05:48:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト