On the Hidden Mystery of OCR in Large Multimodal Models

Yuliang Liu; Zhang Li; Hongliang Li; Wenwen Yu; Mingxin Huang; Dezhi Peng; Mingyu Liu; Mingrui Chen; Chunyuan Li; Cheng-lin Liu; Lianwen Jin; Xiang Bai

大規模なマルチモーダルモデルにおける OCR の隠された謎について

大規模モデルは、最近、自然言語処理とマルチモーダル視覚言語学習において主要な役割を果たしています。テキスト関連の視覚タスクにおけるそれらの有効性については、まだあまり研究されていません。私たちは、既存の公的に利用可能なマルチモーダルモデルの包括的な調査を実施し、テキスト認識 (文書テキスト、芸術的テキスト、手書きテキスト、シーンテキスト)、テキストベースの視覚的な質問応答 (文書テキスト、シーンテキスト、およびバイリンガルテキスト) におけるパフォーマンスを評価しました。重要な情報の抽出 (領収書、文書、栄養成分表示) と手書きの数式認識。私たちの調査結果は、これらのモデルの長所と短所を明らかにしています。これらのモデルは主に単語認識の意味理解に依存しており、個々の文字の形状の認識が劣っていることを示しています。また、テキストの長さには無関心であり、画像内の細かい特徴を検出する能力も限られています。したがって、これらの結果は、現在最も強力な大規模マルチモーダルモデルであっても、従来のテキストタスクにおけるドメイン固有の手法に匹敵することはできず、より複雑なタスクでは大きな課題に直面することを示しています。最も重要なことは、この研究で示されたベースライン結果は、ゼロショットマルチモーダル技術の強化を目的とした革新的な戦略の構想と評価のための基礎的な枠組みを提供できる可能性があることです。評価パイプラインは https://github.com/Yuliang-Liu/MultimodalOCR で入手できます。

Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. It remains less explored about their efficacy in text-related visual tasks. We conducted a comprehensive study of existing publicly available multimodal models, evaluating their performance in text recognition (document text, artistic text, handwritten text, scene text), text-based visual question answering (document text, scene text, and bilingual text), key information extraction (receipts, documents, and nutrition facts) and handwritten mathematical expression recognition. Our findings reveal strengths and weaknesses in these models, which primarily rely on semantic understanding for word recognition and exhibit inferior perception of individual character shapes. They also display indifference towards text length and have limited capabilities in detecting finegrained features in images. Consequently, these results demonstrate that even the current most powerful large multimodal models cannot match domain-specific methods in traditional text tasks and face greater challenges in more complex tasks. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. Evaluation pipeline is available at https://github.com/Yuliang-Liu/MultimodalOCR.

updated: Thu Jun 08 2023 15:14:16 GMT+0000 (UTC)

published: Sat May 13 2023 11:28:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト