DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Maoyuan Ye; Jing Zhang; Shanshan Zhao; Juhua Liu; Tongliang Liu; Bo Du; Dacheng Tao

DeepSolo: テキストスポッティングのために明示的なポイントを使用して Transformer Decoder をソロにする

エンドツーエンドのテキストスポッティングは、シーンテキストの検出と認識を統合されたフレームワークに統合することを目的としています。 2 つのサブタスク間の関係に対処することは、効果的なスポッターを設計する上で極めて重要な役割を果たします。 Transformer ベースの方法ではヒューリスティックな後処理が排除されますが、サブタスク間の相乗効果の問題とトレーニング効率の低さに悩まされます。このホワイトペーパーでは、DeepSolo を紹介します。DeepSolo は、明示的なポイントソロを備えた単一のデコーダーでテキストの検出と認識を同時に行うことができる単純な DETR のようなベースラインです。技術的には、テキストインスタンスごとに、文字シーケンスを順序付けられたポイントとして表し、学習可能な明示的なポイントクエリでモデル化します。単一のデコーダーを通過した後、ポイントクエリは必要なテキストセマンティクスと位置をエンコードしたため、非常に単純な予測ヘッドを並列に介して、テキストの中心線、境界、スクリプト、および信頼度にさらにデコードできます。さらに、テキストマッチング基準を導入して、より正確な監視信号を提供し、より効率的なトレーニングを可能にします。公開ベンチマークでの定量的実験は、DeepSolo が以前の最先端の方法よりも優れており、より優れたトレーニング効率を達成することを示しています。さらに、DeepSolo はラインアノテーションとも互換性があり、ポリゴンよりもはるかに少ないアノテーションコストで済みます。コードは https://github.com/ViTAE-Transformer/DeepSolo で入手できます。

End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple DETR-like baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code is available at https://github.com/ViTAE-Transformer/DeepSolo.

updated: Wed Mar 15 2023 14:03:07 GMT+0000 (UTC)

published: Sat Nov 19 2022 19:06:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト