MANGO: A Mask Attention Guided One-Stage Scene Text Spotter

Liang Qiao; Ying Chen; Zhanzhan Cheng; Yunlu Xu; Yi Niu; Shiliang Pu; Fei Wu

マンゴー：マスクアテンションガイド付きワンステージシーンテキストスポッター

最近、エンドツーエンドのシーンテキストスポッティングは、グローバルな最適化と実際のアプリケーションでの高い保守性という利点により、人気のある研究トピックになっています。ほとんどの方法では、さまざまな関心領域（RoI）操作を開発して、検出部分とシーケンス認識部分を2段階のテキストスポッティングフレームワークに連結しようとします。しかしながら、そのようなフレームワークでは、認識部分は、検出された結果（例えば、テキストの輪郭のコンパクトさ）に非常に敏感である。この問題に対処するために、本論文では、RoI操作なしで文字シーケンスを直接認識できるMANGOという名前の新しいMask AttentioN GuidedOne-stageテキストスポッティングフレームワークを提案します。具体的には、位置認識マスク注意モジュールが開発され、各テキストインスタンスとその文字に注意の重みが生成されます。これにより、画像内のさまざまなテキストインスタンスを、インスタンス機能のバッチとしてさらにグループ化されたさまざまな機能マップチャネルに割り当てることができます。最後に、軽量シーケンスデコーダーを適用して文字シーケンスを生成します。 MANGOは本質的に任意の形状のテキストスポッティングに適応し、粗い位置情報（たとえば、長方形のバウンディングボックス）とテキスト注釈のみでエンドツーエンドでトレーニングできることは注目に値します。実験結果は、提案された方法が、定期的および不規則なテキストスポッティングベンチマーク、すなわち、ICDAR 2013、ICDAR 2015、Total-Text、およびSCUT-CTW1500の両方で競争力のある最新のパフォーマンスを達成することを示しています。

Recently end-to-end scene text spotting has become a popular research topic due to its advantages of global optimization and high maintainability in real applications. Most methods attempt to develop various region of interest (RoI) operations to concatenate the detection part and the sequence recognition part into a two-stage text spotting framework. However, in such framework, the recognition part is highly sensitive to the detected results (e.g.), the compactness of text contours). To address this problem, in this paper, we propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO, in which character sequences can be directly recognized without RoI operation. Concretely, a position-aware mask attention module is developed to generate attention weights on each text instance and its characters. It allows different text instances in an image to be allocated on different feature map channels which are further grouped as a batch of instance features. Finally, a lightweight sequence decoder is applied to generate the character sequences. It is worth noting that MANGO inherently adapts to arbitrary-shaped text spotting and can be trained end-to-end with only coarse position information (e.g.), rectangular bounding box) and text annotations. Experimental results show that the proposed method achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks, i.e., ICDAR 2013, ICDAR 2015, Total-Text, and SCUT-CTW1500.

updated: Mon Oct 25 2021 09:32:55 GMT+0000 (UTC)

published: Tue Dec 08 2020 10:47:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト