Transferring General Multimodal Pretrained Models to Text Recognition

Junyang Lin; Xuancheng Ren; Yichang Zhang; Gao Liu; Peng Wang; An Yang; Chang Zhou

一般的なマルチモーダルの事前トレーニング済みモデルをテキスト認識に転送する

この論文では、マルチモーダルな事前トレーニング済みモデルをテキスト認識に転送する新しい方法、OFA-OCR を提案します。具体的には、テキスト認識を画像キャプションとして作り直し、統合されたビジョン言語の事前トレーニング済みモデルを最終タスクに直接転送します。大規模な注釈付きまたは合成テキスト認識データで事前トレーニングを行わなくても、OFA-OCR はベースラインを上回り、中国語テキスト認識ベンチマークで最先端のパフォーマンスを達成します。さらに、OFA-OCR を使用して OCR パイプラインを構築し、製品レベルの API で競争力のあるパフォーマンスを達成できることを実証します。コード (https://github.com/OFA-Sys/OFA) とデモ (https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary) は公開されています。

This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API. The code (https://github.com/OFA-Sys/OFA) and demo (https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary) are publicly available.

updated: Mon Dec 19 2022 08:30:42 GMT+0000 (UTC)

published: Mon Dec 19 2022 08:30:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト