Aggregated Text Transformer for Scene Text Detection

Zhao Zhou; Xiangcheng Du; Yingbin Zheng; Cheng Jin

シーンテキスト検出のための集約テキストトランスフォーマー

このホワイトペーパーでは、自然画像のシーンテキスト検出のためのマルチスケール集計戦略について説明します。マルチスケール自己注意メカニズムを使用して、シーン画像内のテキストを表すように設計された Aggregated Text TRansformer(ATTR) を紹介します。複数の解像度を持つ画像ピラミッドから始めて、特徴はまず重みを共有して異なるスケールで抽出され、次に Transformer のエンコーダー/デコーダーアーキテクチャに供給されます。マルチスケールの画像表現は堅牢で、さまざまなサイズのテキストコンテンツに関する豊富な情報が含まれています。テキストトランスフォーマーはこれらの機能を集約して、さまざまなスケールでの相互作用を学習し、テキスト表現を改善します。提案された方法は、各テキストインスタンスを個々のバイナリマスクとして表すことによってシーンテキストを検出します。公共シーンのテキスト検出データセットに関する広範な実験により、提案されたフレームワークの有効性が実証されています。

This paper explores the multi-scale aggregation strategy for scene text detection in natural images. We present the Aggregated Text TRansformer(ATTR), which is designed to represent texts in scene images with a multi-scale self-attention mechanism. Starting from the image pyramid with multiple resolutions, the features are first extracted at different scales with shared weight and then fed into an encoder-decoder architecture of Transformer. The multi-scale image representations are robust and contain rich information on text contents of various sizes. The text Transformer aggregates these features to learn the interaction across different scales and improve text representation. The proposed method detects scene texts by representing each text instance as an individual binary mask, which is tolerant of curve texts and regions with dense instances. Extensive experiments on public scene text detection datasets demonstrate the effectiveness of the proposed framework.

updated: Tue Jun 04 2024 08:54:51 GMT+0000 (UTC)

published: Fri Nov 25 2022 09:47:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト