DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer

Maoyuan Ye; Jing Zhang; Shanshan Zhao; Juhua Liu; Bo Du; Dacheng Tao

DPText-DETR：Transformerの動的ポイントを使用したより良いシーンテキスト検出に向けて

最近、ポリゴンポイントまたはベジェ曲線コントロールポイントを予測してテキストをローカライズするTransformerベースの方法が、シーンテキストの検出で非常に人気があります。ただし、使用されるポイントラベル形式は、人間の読み取り順序を意味し、Transformerモデルの堅牢性に影響します。モデルアーキテクチャに関しては、デコーダーで使用されるクエリの定式化は、以前の方法では十分に検討されていません。この論文では、DPText-DETRと呼ばれる簡潔な動的ポイントシーンテキスト検出トランスネットワークを提案します。これは、ポイント座標をクエリとして直接使用し、デコーダーレイヤー間で動的に更新します。オリジナルの副作用に対処するために、シンプルでありながら効果的なポジショナルポイントラベルフォームを指摘します。さらに、Enhanced Factorized Self-Attentionモジュールは、非局所的注意を超えてポリゴンポイントシーケンスの円形を明示的にモデル化するように設計されています。広範な実験により、さまざまな任意の形状のシーンテキストベンチマークでのトレーニング効率、堅牢性、および最先端のパフォーマンスが証明されます。検出器を超えて、既存のエンドツーエンドのスポッターが逆のようなテキストを認識するのに苦労していることを観察します。それらのパフォーマンスを客観的に評価し、将来の研究を容易にするために、500個の手動でラベル付けされた画像を含む逆テキストテストセットを提案します。コードとInverse-Textテストセットは、https：//github.com/ymy-k/DPText-DETRで入手できます。

Recently, Transformer-based methods, which predict polygon points or Bezier curve control points to localize texts, are quite popular in scene text detection. However, the used point label form implies the reading order of humans, which affects the robustness of Transformer model. As for the model architecture, the formulation of queries used in decoder has not been fully explored by previous methods. In this paper, we propose a concise dynamic point scene text detection Transformer network termed DPText-DETR, which directly uses point coordinates as queries and dynamically updates them between decoder layers. We point out a simple yet effective positional point label form to tackle the side effect of the original one. Moreover, an Enhanced Factorized Self-Attention module is designed to explicitly model the circular shape of polygon point sequences beyond non-local attention. Extensive experiments prove the training efficiency, robustness, and state-of-the-art performance on various arbitrary shape scene text benchmarks. Beyond detector, we observe that existing end-to-end spotters struggle to recognize inverse-like texts. To evaluate their performance objectively and facilitate future research, we propose an Inverse-Text test set containing 500 manually labeled images. The code and Inverse-Text test set will be available at https://github.com/ymy-k/DPText-DETR.

updated: Sun Jul 10 2022 15:45:16 GMT+0000 (UTC)

published: Sun Jul 10 2022 15:45:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト