StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Yulin Li; Yuxi Qian; Yuchen Yu; Xiameng Qin; Chengquan Zhang; Yan Liu; Kun Yao; Junyu Han; Jingtuo Liu; Errui Ding

StrucTexT：マルチモーダルトランスフォーマーによる構造化テキストの理解

Visually Rich Documents（VRD）の構造化テキストの理解は、ドキュメントインテリジェンスの重要な部分です。 VRDのコンテンツとレイアウトは複雑であるため、構造化テキストの理解は困難な作業でした。ほとんどの既存の研究では、この問題を2つのサブタスクに分離しました。エンティティのラベル付けとエンティティのリンクです。これらのタスクでは、トークンレベルとセグメントレベルの両方でドキュメントのコンテキストを完全に理解する必要があります。ただし、さまざまなレベルから構造化データを効率的に抽出するソリューションに関する作業はほとんどありません。このホワイトペーパーでは、両方のサブタスクを柔軟かつ効果的に処理できるStrucTexTという名前の統合フレームワークを提案します。具体的には、トランスフォーマーに基づいて、セグメントトークンで整列されたエンコーダーを導入し、さまざまなレベルの粒度でエンティティのラベル付けとエンティティのリンクのタスクを処理します。さらに、より豊かな表現を学習するために、3つの自己教師ありタスクを使用して新しい事前トレーニング戦略を設計します。 StrucTexTは、既存のマスクされた視覚言語モデリングタスクと、新しい文の長さの予測およびペアボックスの方向タスクを使用して、テキスト、画像、およびレイアウト全体にマルチモーダル情報を組み込みます。セグメントレベルおよびトークンレベルで構造化テキストを理解するための方法を評価し、FUNSD、SROIE、およびEPHOIEデータセットで非常に優れたパフォーマンスを備えた最先端の方法よりも優れていることを示します。

Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity of content and layout in VRDs, structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling and entity linking, which require an entire understanding of the context of documents at both token and segment levels. However, little work has been concerned with the solutions that efficiently extract the structured data from different levels. This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks. Specifically, based on the transformer, we introduce a segment-token aligned encoder to deal with the entity labeling and entity linking tasks at different levels of granularity. Moreover, we design a novel pre-training strategy with three self-supervised tasks to learn a richer representation. StrucTexT uses the existing Masked Visual Language Modeling task and the new Sentence Length Prediction and Paired Boxes Direction tasks to incorporate the multi-modal information across text, image, and layout. We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts with significantly superior performance on the FUNSD, SROIE, and EPHOIE datasets.

updated: Tue Aug 10 2021 03:44:20 GMT+0000 (UTC)

published: Fri Aug 06 2021 02:57:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト