M^6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis

Hiuyi Cheng; Peirong Zhang; Sihang Wu; Jiaxin Zhang; Qiyuan Zhu; Zecheng Xie; Jing Li; Kai Ding; Lianwen Jin

M^6Doc: 最新のドキュメントレイアウト分析のための大規模なマルチフォーマット、マルチタイプ、マルチレイアウト、マルチ言語、マルチアノテーションカテゴリデータセット

文書レイアウト分析は、文書の検索や変換など、文書を理解するための重要な前提条件です。現在、ほとんどの公開データセットには PDF ドキュメントのみが含まれており、現実的なドキュメントがありません。これらのデータセットでトレーニングされたモデルは、現実世界のシナリオにうまく一般化できない可能性があります。そこで、本稿では M^6Doc と呼ばれる大規模かつ多様な文書レイアウト解析データセットを紹介します。 M^6 の指定は、次の 6 つのプロパティを表します。(1) マルチフォーマット (スキャン、写真、PDF ドキュメントを含む)。 (2) マルチタイプ（科学論文、教科書、書籍、試験問題、雑誌、新聞、ノートなど）。 (3) マルチレイアウト (長方形、マンハッタン、非マンハッタン、および複数列のマンハッタン)。 (4) 多言語（中国語と英語）。 (5) マルチアノテーションカテゴリ (9,080 の手動アノテーションページに 237,116 個のアノテーションインスタンスを含む 74 種類のアノテーションラベル)。 (6) 現代の文書。さらに、TransDLANet と呼ばれるトランスフォーマーベースのドキュメントレイアウト分析手法を提案します。この手法は、適応要素マッチングメカニズムを利用して、クエリの埋め込みをグラウンドトゥルースとよりよく一致させて再現率を向上させ、より正確なドキュメント画像インスタンスのセグメンテーションのためのセグメンテーションブランチを構築します。さまざまなレイアウト解析手法を用いて M^6Doc を総合的に評価し、その有効性を実証します。 TransDLANet は、M^6Doc 上で 64.5% の mAP で最先端のパフォーマンスを達成します。 M^6Doc データセットは https://github.com/HCIILAB/M6Doc で入手できます。

Document layout analysis is a crucial prerequisite for document understanding, including document retrieval and conversion. Most public datasets currently contain only PDF documents and lack realistic documents. Models trained on these datasets may not generalize well to real-world scenarios. Therefore, this paper introduces a large and diverse document layout analysis dataset called M^6Doc. The M^6 designation represents six properties: (1) Multi-Format (including scanned, photographed, and PDF documents); (2) Multi-Type (such as scientific articles, textbooks, books, test papers, magazines, newspapers, and notes); (3) Multi-Layout (rectangular, Manhattan, non-Manhattan, and multi-column Manhattan); (4) Multi-Language (Chinese and English); (5) Multi-Annotation Category (74 types of annotation labels with 237,116 annotation instances in 9,080 manually annotated pages); and (6) Modern documents. Additionally, we propose a transformer-based document layout analysis method called TransDLANet, which leverages an adaptive element matching mechanism that enables query embedding to better match ground truth to improve recall, and constructs a segmentation branch for more precise document image instance segmentation. We conduct a comprehensive evaluation of M^6Doc with various layout analysis methods and demonstrate its effectiveness. TransDLANet achieves state-of-the-art performance on M^6Doc with 64.5% mAP. The M^6Doc dataset will be available at https://github.com/HCIILAB/M6Doc.

updated: Sun May 21 2023 14:22:39 GMT+0000 (UTC)

published: Mon May 15 2023 15:29:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト