Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution

Jianfeng Kuang; Wei Hua; Dingkang Liang; Mingkun Yang; Deqiang Jiang; Bo Ren; Xiang Bai

現場での視覚情報抽出: 実用的なデータセットとエンドツーエンドのソリューション

統合フレームワークで OCR と情報抽出を同時に実行することを目的とした視覚情報抽出 (VIE) は、領収書、商品、交通標識の理解など、さまざまなアプリケーションで重要な役割を果たしているため、ますます注目を集めています。ただし、VIE の既存のベンチマークデータセットは主にドキュメント画像で構成されており、レイアウト構造、背景の乱れ、エンティティカテゴリの適切な多様性が欠けているため、現実世界のアプリケーションの課題を完全に明らかにすることはできません。この論文では、VIE 用のカメラ画像で構成される大規模なデータセットを提案します。このデータセットには、レイアウト、背景、フォントの多様性が大きいだけでなく、より多くの種類のエンティティも含まれています。さらに、OCR と情報抽出の段階をエンドツーエンドの学習方式で組み合わせた、エンドツーエンド VIE の新しいフレームワークを提案します。情報抽出モジュールの入力として OCR 機能を直接採用するこれまでのエンドツーエンドのアプローチとは異なり、OCR と情報抽出のタスクの違いによって生じる意味論的なギャップを狭めるために対照学習を使用することを提案します。提案されたデータセットで VIE の既存のエンドツーエンド手法を評価し、レイアウトとエンティティの差異が大きいため、これらの手法のパフォーマンスが SROIE (広く使用されている英語のデータセット) から提案されたデータセットに比べて明らかに低下していることを観察しました。。これらの結果は、私たちのデータセットが高度な VIE アルゴリズムを促進するためにより実用的であることを示しています。さらに、実験では、提案された VIE 手法が、提案されたデータセットと SROIE データセットで明らかなパフォーマンス向上を一貫して達成することを示しています。

Visual information extraction (VIE), which aims to simultaneously perform OCR and information extraction in a unified framework, has drawn increasing attention due to its essential role in various applications like understanding receipts, goods, and traffic signs. However, as existing benchmark datasets for VIE mainly consist of document images without the adequate diversity of layout structures, background disturbs, and entity categories, they cannot fully reveal the challenges of real-world applications. In this paper, we propose a large-scale dataset consisting of camera images for VIE, which contains not only the larger variance of layout, backgrounds, and fonts but also much more types of entities. Besides, we propose a novel framework for end-to-end VIE that combines the stages of OCR and information extraction in an end-to-end learning fashion. Different from the previous end-to-end approaches that directly adopt OCR features as the input of an information extraction module, we propose to use contrastive learning to narrow the semantic gap caused by the difference between the tasks of OCR and information extraction. We evaluate the existing end-to-end methods for VIE on the proposed dataset and observe that the performance of these methods has a distinguishable drop from SROIE (a widely used English dataset) to our proposed dataset due to the larger variance of layout and entities. These results demonstrate our dataset is more practical for promoting advanced VIE algorithms. In addition, experiments demonstrate that the proposed VIE method consistently achieves the obvious performance gains on the proposed and SROIE datasets.

updated: Thu Jun 15 2023 03:31:12 GMT+0000 (UTC)

published: Fri May 12 2023 14:11:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト