Multimodal Semi-Supervised Learning for Text Recognition

Aviad Aberdam; Roy Ganz; Shai Mazor; Ron Litman

テキスト認識のためのマルチモーダル半教師あり学習

最近まで、公開されている実世界のテキスト画像の数は、シーンのテキスト認識機能をトレーニングするには不十分でした。したがって、最新のトレーニング方法のほとんどは、合成データに依存しており、完全に監視された方法で動作します。それにもかかわらず、大量のラベルのないデータを含め、公開されている実世界のテキスト画像の量は最近大幅に増加しています。これらのリソースを活用するには、半教師ありアプローチが必要です。ただし、いくつかの既存の方法は、視覚言語のマルチモーダル構造を考慮していないため、最先端のマルチモーダルアーキテクチャには最適ではありません。このギャップを埋めるために、各モダリティトレーニングフェーズでラベルなしデータを活用するマルチモーダルテキストレコグナイザー（SemiMTR）の半教師あり学習を紹介します。特に、私たちの方法は、余分なトレーニング段階を控え、現在の3段階のマルチモーダルトレーニング手順を維持します。私たちのアルゴリズムは、教師あり学習と教師あり学習を統合する単一段階のトレーニングを通じて、ビジョンモデルを事前トレーニングすることから始まります。より具体的には、既存の視覚表現学習アルゴリズムを拡張し、シーンテキスト認識のための最初のコントラストベースの方法を提案します。テキストコーパスで言語モデルを事前トレーニングした後、テキスト画像の弱く増強されたビューと強く増強されたビューの間の順次の文字レベルの一貫性の正則化を介してネットワーク全体を微調整します。新しい設定では、一貫性が各モダリティに個別に適用されます。広範な実験により、私たちの方法が現在のトレーニングスキームを上回り、複数のシーンのテキスト認識ベンチマークで最先端の結果を達成していることが検証されます。

Until recently, the number of public real-world text images was insufficient for training scene text recognizers. Therefore, most modern training methods rely on synthetic data and operate in a fully supervised manner. Nevertheless, the amount of public real-world text images has increased significantly lately, including a great deal of unlabeled data. Leveraging these resources requires semi-supervised approaches; however, the few existing methods do not account for vision-language multimodality structure and therefore suboptimal for state-of-the-art multimodal architectures. To bridge this gap, we present semi-supervised learning for multimodal text recognizers (SemiMTR) that leverages unlabeled data at each modality training phase. Notably, our method refrains from extra training stages and maintains the current three-stage multimodal training procedure. Our algorithm starts by pretraining the vision model through a single-stage training that unifies self-supervised learning with supervised training. More specifically, we extend an existing visual representation learning algorithm and propose the first contrastive-based method for scene text recognition. After pretraining the language model on a text corpus, we fine-tune the entire network via a sequential, character-level, consistency regularization between weakly and strongly augmented views of text images. In a novel setup, consistency is enforced on each modality separately. Extensive experiments validate that our method outperforms the current training schemes and achieves state-of-the-art results on multiple scene text recognition benchmarks.

updated: Sun May 08 2022 13:55:30 GMT+0000 (UTC)

published: Sun May 08 2022 13:55:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト