Turning a CLIP Model into a Scene Text Spotter

Wenwen Yu; Yuliang Liu; Xingkui Zhu; Haoyu Cao; Xing Sun; Xiang Bai

CLIP モデルをシーンテキストスポッターに変える

私たちは、大規模な Contrastive Language-Image Pretraining (CLIP) モデルの可能性を活用して、シーンテキストの検出とスポッティングタスクを強化し、それを堅牢なバックボーンである FastTCM-CR50 に変換します。このバックボーンは、CLIP の視覚的なプロンプト学習とクロスアテンションを利用して、画像およびテキストベースの事前知識を抽出します。 FastTCM-CR50 は、事前定義された学習可能なプロンプトを使用して、インスタンス言語のマッチングプロセスを導入し、画像とテキストの埋め込み間の相乗効果を高め、テキスト領域を洗練します。当社の二峰性類似性マッチング (BSM) モジュールは、動的な言語プロンプトの生成を容易にし、オフライン計算を可能にし、パフォーマンスを向上させます。 FastTCM-CR50 にはいくつかの利点があります。 1) 既存のテキスト検出器とスポッターを強化し、それぞれ平均 1.7% と 1.5% パフォーマンスを向上させることができます。 2) 以前の TCM-CR50 バックボーンよりも優れたパフォーマンスを発揮し、テキスト検出およびスポッティングタスクで平均 0.2% と 0.56% の向上が得られ、推論速度も 48.5% 向上しました。 3) 堅牢な数ショットトレーニング機能を示します。 FastTCM-CR50 は、教師付きデータの 10% のみを利用して、テキスト検出タスクとスポッティングタスクのパフォーマンスをそれぞれ平均 26.5% と 5.5% 向上させます。 4) 配布外のテキスト検出とスポッティングデータセット、特に ICDAR2019-ArT の NightTime-ArT サブセットと指向性オブジェクト検出用の DOTA データセットのパフォーマンスを一貫して向上させます。コードは https://github.com/wenwenyu/TCM で入手できます。

We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. Using predefined and learnable prompts, FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings, thereby refining text regions. Our Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt generation, enabling offline computations and improving performance. FastTCM-CR50 offers several advantages: 1) It can enhance existing text detectors and spotters, improving performance by an average of 1.7% and 1.5%, respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an average improvement of 0.2% and 0.56% in text detection and spotting tasks, along with a 48.5% increase in inference speed. 3) It showcases robust few-shot training capabilities. Utilizing only 10% of the supervised data, FastTCM-CR50 improves performance by an average of 26.5% and 5.5% for text detection and spotting tasks, respectively. 4) It consistently enhances performance on out-of-distribution text detection and spotting datasets, particularly the NightTime-ArT subset from ICDAR2019-ArT and the DOTA dataset for oriented object detection. The code is available at https://github.com/wenwenyu/TCM.

updated: Mon Aug 21 2023 01:25:48 GMT+0000 (UTC)

published: Mon Aug 21 2023 01:25:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト