Self-supervised Character-to-Character Distillation

Tongkun Guan; Wei Shen

自己管理型のキャラクターからキャラクターへの蒸留

複雑なテキスト画像 (たとえば、不規則な構造、低解像度、重いオクルージョン、照明など) を処理する既存の教師付きテキスト認識方法は、データを大量に消費します。これらの方法は大規模な合成テキスト画像を使用して注釈付きの実画像への依存を減らしますが、ドメインギャップによって認識パフォーマンスが制限されます。したがって、自己教師あり学習によってラベルのない実際の画像で堅牢なテキスト機能表現を探索することは、良い解決策です。ただし、既存の自己教師ありテキスト認識方法は、視覚的特徴を水平軸に沿って大まかに分割することによってシーケンスからシーケンスへの表現学習を実行するだけであり、文字構造が損なわれます。さらに、これらのシーケンシャルレベルの自己学習方法は、大規模なジオメトリの増強がシーケンス間の不一致につながるため、ジオメトリベースのデータ増強の可用性を制限します。上記の問題に対処するために、新しい自己教師付き文字から文字への蒸留法であるCCDを提案しました。具体的には、自己教師付き文字セグメンテーションモジュールを設計することにより、ラベルのない実際の画像の文字構造を描写し、さらにセグメンテーション結果を適用して文字レベルの表現学習を構築します。 CCD は、よりきめ細かい特徴表現を学習するために文字レベルの口実タスクを提案するという点で、以前の作品とは異なります。さらに、sequence-to-sequence モデルの柔軟性のない拡張と比較して、私たちの作業は、さまざまな変換 (たとえば、ジオメトリや色) にわたって文字間の表現の一貫性を満たし、表現空間で堅牢なテキスト機能を生成します。実験は、CCD が公開されているテキスト認識ベンチマークで最先端のパフォーマンスを達成することを示しています。

Handling complicated text images (e.g., irregular structures, low resolution, heavy occlusion, and even illumination), existing supervised text recognition methods are data-hungry. Although these methods employ large-scale synthetic text images to reduce the dependence on annotated real images, the domain gap limits the recognition performance. Therefore, exploring the robust text feature representation on unlabeled real images by self-supervised learning is a good solution. However, existing self-supervised text recognition methods only execute sequence-to-sequence representation learning by roughly splitting the visual features along the horizontal axis, which will damage the character structures. Besides, these sequential-level self-learning methods limit the availability of geometric-based data augmentation, as large-scale geometry augmentation leads to sequence-to-sequence inconsistency. To address the above-mentioned issues, we proposed a novel self-supervised character-to-character distillation method, CCD. Specifically, we delineate the character structures of unlabeled real images by designing a self-supervised character segmentation module, and further apply the segmentation results to build character-level representation learning. CCD differs from prior works in that we propose a character-level pretext task to learn more fine-grained feature representations. Besides, compared with the inflexible augmentations of sequence-to-sequence models, our work satisfies character-to-character representation consistency, across various transformations (e.g., geometry and colour), to generate robust text features in the representative space. Experiments demonstrate that CCD achieves state-of-the-art performance on publicly available text recognition benchmarks.

updated: Tue Nov 01 2022 05:48:18 GMT+0000 (UTC)

published: Tue Nov 01 2022 05:48:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト