Looking and Listening: Audio Guided Text Recognition

Wenwen Yu; Mingyu Liu; Biao Yang; Enming Zhang; Deqiang Jiang; Xing Sun; Yuliang Liu; Xiang Bai

見ることと聞くこと: 音声ガイド付きテキスト認識

実際のテキスト認識は、コンピュータビジョンにおける長年の問題です。エンドツーエンドの深層学習によって推進される最近の研究では、視覚と言語処理がシーンのテキスト認識に効果的であることが示唆されています。しかし、追加、削除、置換などの編集エラーを解決することは、依然として既存のアプローチにとっての主な課題です。実際、テキストの内容とその音声は自然に対応しています。つまり、単一の文字の間違いが明らかに異なる発音を引き起こす可能性があります。この論文では、シーンテキスト認識をガイドするためのメルスペクトログラムシーケンス予測のためのシンプルかつ効果的な確率的オーディオデコーダであるAudioOCRを提案します。これはトレーニングフェーズにのみ参加し、推論ステージ中に追加のコストをもたらしません。 AudioOCR の基礎となる原理は、既存のアプローチに簡単に適用できます。既存の 12 の規則的、不規則、および遮蔽されたベンチマークで 7 つの以前のシーンテキスト認識手法を使用した実験により、提案した手法が一貫した改善をもたらすことが実証されました。さらに重要なことは、私たちの実験を通じて、AudioOCR には、英語以外のテキスト、語彙外の単語、さまざまなアクセントを持つテキストの認識など、より困難なシナリオにも拡張できる汎用性があることが示されたことです。コードは https://github.com/wenwenyu/AudioOCR で入手できます。

Text recognition in the wild is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest vision and language processing are effective for scene text recognition. Yet, solving edit errors such as add, delete, or replace is still the main challenge for existing approaches. In fact, the content of the text and its audio are naturally corresponding to each other, i.e., a single character error may result in a clear different pronunciation. In this paper, we propose the AudioOCR, a simple yet effective probabilistic audio decoder for mel spectrogram sequence prediction to guide the scene text recognition, which only participates in the training phase and brings no extra cost during the inference stage. The underlying principle of AudioOCR can be easily applied to the existing approaches. Experiments using 7 previous scene text recognition methods on 12 existing regular, irregular, and occluded benchmarks demonstrate our proposed method can bring consistent improvement. More importantly, through our experimentation, we show that AudioOCR possesses a generalizability that extends to more challenging scenarios, including recognizing non-English text, out-of-vocabulary words, and text with various accents. Code will be available at https://github.com/wenwenyu/AudioOCR.

updated: Tue Jun 06 2023 08:08:18 GMT+0000 (UTC)

published: Tue Jun 06 2023 08:08:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト