Text is Text, No Matter What: Unifying Text Recognition using Knowledge Distillation

Ayan Kumar Bhunia; Aneeshan Sain; Pinaki Nath Chowdhury; Yi-Zhe Song

テキストはテキストであり、問題はありません：知識蒸留を使用したテキスト認識の統合

テキスト認識は、主にその幅広い商用アプリケーションのために、コンピュータビジョンの基本的で広範囲に研究されたトピックのままです。しかし、非常に困難な問題の性質により、研究努力の断片化が必要になりました。日常のシーンのテキストを処理するシーンテキスト認識（STR）と、手書きテキストに取り組む手書きテキスト認識（HTR）です。この論文では、初めて、それらの統合について議論します-私たちは、2つの別々の最先端のSTRおよびHTRモデルと有利に競争できる単一のモデルを目指しています。最初に、STRモデルとHTRモデルの相互利用が、固有の課題の違いにより、パフォーマンスの大幅な低下を引き起こすことを示します。次に、知識蒸留（KD）ベースのフレームワークを導入することにより、それらの結合に取り組みます。ただし、これは重要です。これは主に、テキストシーケンスの可変長およびシーケンシャルな性質により、グローバルな固定長データで主に機能する既成のKD手法が不十分になるためです。そのために、3つの蒸留損失を提案します。これらはすべて、前述のテキスト認識の固有の特性に対処するように特別に設計されています。経験的証拠は、提案された統合モデルが個々のモデルと同等に機能し、場合によってはそれらを上回っていることを示唆しています。奪格研究は、2段階のフレームワークなどの単純なベースライン、およびドメインの適応/一般化の代替案も同様に機能しないことを示しており、設計の適切性をさらに検証しています。

Text recognition remains a fundamental and extensively researched topic in computer vision, largely owing to its wide array of commercial applications. The challenging nature of the very problem however dictated a fragmentation of research efforts: Scene Text Recognition (STR) that deals with text in everyday scenes, and Handwriting Text Recognition (HTR) that tackles hand-written text. In this paper, for the first time, we argue for their unification -- we aim for a single model that can compete favourably with two separate state-of-the-art STR and HTR models. We first show that cross-utilisation of STR and HTR models trigger significant performance drops due to differences in their inherent challenges. We then tackle their union by introducing a knowledge distillation (KD) based framework. This is however non-trivial, largely due to the variable-length and sequential nature of text sequences, which renders off-the-shelf KD techniques that mostly works with global fixed-length data inadequate. For that, we propose three distillation losses all of which are specifically designed to cope with the aforementioned unique characteristics of text recognition. Empirical evidence suggests that our proposed unified model performs on par with individual models, even surpassing them in certain cases. Ablative studies demonstrate that naive baselines such as a two-stage framework, and domain adaption/generalisation alternatives do not work as well, further verifying the appropriateness of our design.

updated: Mon Jul 26 2021 10:10:34 GMT+0000 (UTC)

published: Mon Jul 26 2021 10:10:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト