DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Maoyuan Ye; Jing Zhang; Shanshan Zhao; Juhua Liu; Tongliang Liu; Bo Du; Dacheng Tao

DeepSolo: テキストスポッティングのために明示的なポイントを使用して Transformer Decoder をソロにする

エンドツーエンドのテキストスポッティングは、シーンテキストの検出と認識を統合されたフレームワークに統合することを目的としています。 2 つのサブタスク間の関係に対処することは、効果的なスポッターを設計する上で極めて重要な役割を果たします。トランスフォーマーベースの方法はヒューリスティックな後処理を排除しますが、サブタスク間の相乗効果の問題とトレーニング効率の低さに悩まされています。このホワイトペーパーでは、DeepSolo を紹介します。DeepSolo は、明示的なポイントソロを備えた単一のデコーダーでテキストの検出と認識を同時に行うことができるシンプルな検出トランスフォーマーのベースラインです。技術的には、テキストインスタンスごとに、文字シーケンスを順序付けられたポイントとして表し、学習可能な明示的なポイントクエリでモデル化します。単一のデコーダーを通過した後、ポイントクエリは必要なテキストセマンティクスと位置をエンコードしたため、非常に単純な予測ヘッドを並行して介してテキストの中心線、境界、スクリプト、および信頼度にさらにデコードし、テキストのサブタスクを解決することができます。統一されたフレームワークでのスポッティング。さらに、テキストマッチング基準を導入して、より正確な監視信号を提供し、より効率的なトレーニングを可能にします。公開ベンチマークでの定量的実験は、DeepSolo が以前の最先端の方法よりも優れており、より優れたトレーニング効率を達成することを示しています。さらに、DeepSolo はラインアノテーションとも互換性があり、ポリゴンよりもはるかに少ないアノテーションコストで済みます。コードが公開されます。

End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple detection transformer baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations and thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel, solving the sub-tasks in text spotting in a unified framework. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code will be released.

updated: Sat Nov 19 2022 19:06:22 GMT+0000 (UTC)

published: Sat Nov 19 2022 19:06:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト