Sequence-to-Sequence Contrastive Learning for Text Recognition

Aviad Aberdam; Ron Litman; Shahar Tsiper; Oron Anschel; Ron Slossberg; Shai Mazor; R. Manmatha; Pietro Perona

テキスト認識のためのシーケンス間対照学習

テキスト認識に適用する視覚表現のシーケンス間対照学習（SeqCLR）のフレームワークを提案します。シーケンス間の構造を説明するために、各機能マップは、対照的な損失が計算されるさまざまなインスタンスに分割されます。この操作により、サブワードレベルで対比することができます。各画像から、いくつかの正のペアと複数の負の例を抽出します。テキスト認識のための効果的な視覚的表現を生み出すために、さらに、新しい拡張ヒューリスティック、さまざまなエンコーダアーキテクチャ、およびカスタムプロジェクションヘッドを提案します。手書きテキストとシーンテキストでの実験は、テキストデコーダーが学習された表現でトレーニングされると、私たちの方法が非順次対照法よりも優れていることを示しています。さらに、教師あり学習の量を減らすと、SeqCLRは教師ありトレーニングと比較してパフォーマンスが大幅に向上し、100％のラベルで微調整すると、標準の手書きテキスト認識ベンチマークで最先端の結果が得られます。

We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition. To account for the sequence-to-sequence structure, each feature map is divided into different instances over which the contrastive loss is computed. This operation enables us to contrast in a sub-word level, where from each image we extract several positive pairs and multiple negative examples. To yield effective visual representations for text recognition, we further suggest novel augmentation heuristics, different encoder architectures and custom projection heads. Experiments on handwritten text and on scene text show that when a text decoder is trained on the learned representations, our method outperforms non-sequential contrastive methods. In addition, when the amount of supervision is reduced, SeqCLR significantly improves performance compared with supervised training, and when fine-tuned with 100% of the labels, our method achieves state-of-the-art results on standard handwritten text recognition benchmarks.

updated: Sun Dec 20 2020 09:07:41 GMT+0000 (UTC)

published: Sun Dec 20 2020 09:07:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト