Modeling Entities as Semantic Points for Visual Information Extraction in the Wild

Zhibo Yang; Rujiao Long; Pengfei Wang; Sibo Song; Humen Zhong; Wenqing Cheng; Xiang Bai; Cong Yao

自然界で視覚情報を抽出するためのセマンティックポイントとしてのエンティティのモデリング

最近、視覚情報抽出 (VIE) は、幅広い実世界でのアプリケーションにより、学界と産業界の両方でますます重要になってきています。これまで、この問題に取り組むために多くの研究が提案されてきました。ただし、これらの方法を評価するために使用されるベンチマークは比較的単純です。つまり、これらのベンチマークでは、現実世界の複雑なシナリオが完全には表現されていません。この作業の最初の貢献として、VIE 用の新しいデータセットをキュレートしてリリースします。このデータセットでは、ドキュメント画像が実際のアプリケーションから取得されたという点ではるかに挑戦的であり、ぼかし、部分的なオクルージョン、印刷シフトなどの問題は非常に困難です。一般。これらすべての要因が、情報抽出の失敗につながる可能性があります。したがって、2番目の貢献として、このような厳しい条件下でドキュメント画像から重要な情報を正確かつ堅牢に抽出するための代替アプローチを探ります。具体的には、視覚情報をマルチモーダルアーキテクチャに組み込むか、テキストスポッティングと情報抽出をエンドツーエンドでトレーニングする従来の方法とは対照的に、エンティティをセマンティックポイント、つまりエンティティの中心点として明示的にモデル化します。さまざまなエンティティの属性と関係を説明するセマンティック情報で強化されており、エンティティのラベル付けとリンクに大いに役立つ可能性があります。この分野の標準ベンチマークと提案されたデータセットに関する広範な実験は、提案された方法が、以前の最先端のモデルと比較して、エンティティのラベル付けとリンクで大幅に強化されたパフォーマンスを達成できることを示しています。データセットは https://www.modelscope.cn/datasets/damo/SIBR/summary で入手できます。

Recently, Visual Information Extraction (VIE) has been becoming increasingly important in both the academia and industry, due to the wide range of real-world applications. Previously, numerous works have been proposed to tackle this problem. However, the benchmarks used to assess these methods are relatively plain, i.e., scenarios with real-world complexity are not fully represented in these benchmarks. As the first contribution of this work, we curate and release a new dataset for VIE, in which the document images are much more challenging in that they are taken from real applications, and difficulties such as blur, partial occlusion, and printing shift are quite common. All these factors may lead to failures in information extraction. Therefore, as the second contribution, we explore an alternative approach to precisely and robustly extract key information from document images under such tough conditions. Specifically, in contrast to previous methods, which usually either incorporate visual information into a multi-modal architecture or train text spotting and information extraction in an end-to-end fashion, we explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities, which could largely benefit entity labeling and linking. Extensive experiments on standard benchmarks in this field as well as the proposed dataset demonstrate that the proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models. Dataset is available at https://www.modelscope.cn/datasets/damo/SIBR/summary.

updated: Thu Mar 23 2023 08:21:16 GMT+0000 (UTC)

published: Thu Mar 23 2023 08:21:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト