DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer

Maoyuan Ye; Jing Zhang; Shanshan Zhao; Juhua Liu; Bo Du; Dacheng Tao

DPText-DETR: Transformer の動的ポイントを使用したシーンテキスト検出の向上に向けて

最近では、ポリゴンポイントまたはベジエ曲線制御点を予測してテキストをローカライズする Transformer ベースの方法が、シーンテキスト検出で人気があります。ただし、検出トランスフォーマーフレームワークに基づいて構築されたこれらの方法は、粗い位置クエリモデリングが原因で、次善のトレーニング効率とパフォーマンスを達成する可能性があります。観察。これらの課題に対処するために、このホワイトペーパーでは、DPText-DETR と呼ばれる簡潔な Dynamic Point Text DEtection TRansformer ネットワークを提案します。詳細には、DPText-DETR は明示的なポイント座標を直接活用して位置クエリを生成し、それらを漸進的な方法で動的に更新します。さらに、Transformer の非局所的自己注意の空間的誘導バイアスを改善するために、各インスタンス内で円形ガイダンスを使用してポイントクエリを提供する拡張因子分解自己注意モジュールを提示します。さらに、以前のフォームの副作用に対処するために、シンプルで効果的な位置ラベルフォームを設計します。実際のシナリオでの検出の堅牢性に対するさまざまなラベル形式の影響をさらに評価するために、500 の手動でラベル付けされた画像を含む逆テキストテストセットを確立します。広範な実験により、一般的なベンチマークでの私たちの方法の高いトレーニング効率、堅牢性、および最先端のパフォーマンスが証明されています。コードと Inverse-Text テストセットは、https://github.com/ymy-k/DPText-DETR で入手できます。

Recently, Transformer-based methods, which predict polygon points or Bezier curve control points for localizing texts, are popular in scene text detection. However, these methods built upon detection transformer framework might achieve sub-optimal training efficiency and performance due to coarse positional query modeling.In addition, the point label form exploited in previous works implies the reading order of humans, which impedes the detection robustness from our observation. To address these challenges, this paper proposes a concise Dynamic Point Text DEtection TRansformer network, termed DPText-DETR. In detail, DPText-DETR directly leverages explicit point coordinates to generate position queries and dynamically updates them in a progressive way. Moreover, to improve the spatial inductive bias of non-local self-attention in Transformer, we present an Enhanced Factorized Self-Attention module which provides point queries within each instance with circular shape guidance. Furthermore, we design a simple yet effective positional label form to tackle the side effect of the previous form. To further evaluate the impact of different label forms on the detection robustness in real-world scenario, we establish an Inverse-Text test set containing 500 manually labeled images. Extensive experiments prove the high training efficiency, robustness, and state-of-the-art performance of our method on popular benchmarks. The code and the Inverse-Text test set are available at https://github.com/ymy-k/DPText-DETR.

updated: Mon Nov 28 2022 12:57:16 GMT+0000 (UTC)

published: Sun Jul 10 2022 15:45:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト