SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation

Qingwen Bu; Sungrae Park; Minsoo Khang; Yichuan Cheng

SRFormer: セグメンテーションによる回帰ベースのテキスト検出トランスフォーマーの強化

テキスト検出のための既存の技術は、セグメンテーションベースの方法と回帰ベースの方法という 2 つの主要なグループに大別できます。セグメンテーションモデルは、フォントのバリエーションに対する堅牢性を強化しますが、複雑な後処理が必要となり、計算オーバーヘッドが高くなります。回帰ベースの手法はインスタンスを認識した予測を行いますが、高レベルの表現に依存しているため、堅牢性とデータ効率に限界があります。私たちは学術的な追求において、セグメンテーションと回帰を統合した統合 DETR ベースのモデルである SRFormer を提案します。これは、インスタンスレベルの回帰の簡単な後処理とともに、セグメンテーション表現に固有の堅牢性を相乗的に利用することを目的としています。私たちの経験的分析は、最初のデコーダ層で良好なセグメンテーション予測が得られることを示しています。これを考慮して、セグメンテーションブランチの組み込みを最初のいくつかのデコーダ層に制限し、後続の層で漸進的回帰洗練を採用して、マスクによる追加の計算負荷を最小限に抑えながらパフォーマンスの向上を達成します。さらに、マスク情報に基づいたクエリ拡張モジュールを提案します。セグメンテーションの結果を自然なソフト ROI として取得し、堅牢なピクセル表現をプールおよび抽出し、インスタンスクエリの強化と多様化に使用します。複数のベンチマークにわたる広範な実験により、私たちの手法の並外れた堅牢性、優れたトレーニングとデータ効率、そして最先端のパフォーマンスを強調する説得力のある発見が得られました。

Existing techniques for text detection can be broadly classified into two primary groups: segmentation-based methods and regression-based methods. Segmentation models offer enhanced robustness to font variations but require intricate post-processing, leading to high computational overhead. Regression-based methods undertake instance-aware prediction but face limitations in robustness and data efficiency due to their reliance on high-level representations. In our academic pursuit, we propose SRFormer, a unified DETR-based model with amalgamated Segmentation and Regression, aiming at the synergistic harnessing of the inherent robustness in segmentation representations, along with the straightforward post-processing of instance-level regression. Our empirical analysis indicates that favorable segmentation predictions can be obtained at the initial decoder layers. In light of this, we constrain the incorporation of segmentation branches to the first few decoder layers and employ progressive regression refinement in subsequent layers, achieving performance gains while minimizing additional computational load from the mask. Furthermore, we propose a Mask-informed Query Enhancement module. We take the segmentation result as a natural soft-ROI to pool and extract robust pixel representations, which are then employed to enhance and diversify instance queries. Extensive experimentation across multiple benchmarks has yielded compelling findings, highlighting our method's exceptional robustness, superior training and data efficiency, as well as its state-of-the-art performance.

updated: Mon Aug 21 2023 07:34:31 GMT+0000 (UTC)

published: Mon Aug 21 2023 07:34:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト