Multimodal Pre-training Based on Graph Attention Network for Document Understanding

Zhenrong Zhang; Jiefeng Ma; Jun Du; Licheng Wang; Jianshu Zhang

ドキュメント理解のためのGraph Attention Networkに基づくマルチモーダル事前トレーニング

比較的新しい研究トピックとしてのドキュメントインテリジェンスは、多くのビジネスアプリケーションをサポートしています。その主なタスクは、ドキュメントを自動的に読み取り、理解し、分析することです。しかし、文書のフォーマット（請求書、報告書、帳票など）やレイアウトが多様であるため、文書を機械に理解させることは困難です。このホワイトペーパーでは、さまざまなドキュメント理解タスクのためのマルチモーダルグラフの注意ベースのモデルである GraphDoc を紹介します。 GraphDoc は、テキスト、レイアウト、および画像情報を同時に利用することにより、マルチモーダルフレームワークで事前にトレーニングされています。ドキュメントでは、テキストブロックは周囲のコンテキストに大きく依存しているため、グラフ構造をアテンションメカニズムに挿入してグラフアテンションレイヤーを形成し、各入力ノードがその近傍のみにアテンドできるようにします。各グラフアテンションレイヤーの入力ノードは、ドキュメントイメージ内の意味的に意味のある領域からのテキスト、視覚、および位置の特徴で構成されます。ゲート融合層により、各ノードのマルチモーダル機能融合を行います。各ノード間のコンテキスト化は、グラフアテンションレイヤーによってモデル化されます。 GraphDoc は、Masked Sentence Modeling タスクを介して、ラベルのない 320,000 のドキュメントのみから一般的な表現を学習します。公開されているデータセットに関する広範な実験結果は、GraphDoc が最先端のパフォーマンスを達成することを示しており、提案された方法の有効性を示しています。コードは https://github.com/ZZR8066/GraphDoc で入手できます。

Document intelligence as a relatively new research topic supports many business applications. Its main task is to automatically read, understand, and analyze documents. However, due to the diversity of formats (invoices, reports, forms, etc.) and layouts in documents, it is difficult to make machines understand documents. In this paper, we present the GraphDoc, a multimodal graph attention-based model for various document understanding tasks. GraphDoc is pre-trained in a multimodal framework by utilizing text, layout, and image information simultaneously. In a document, a text block relies heavily on its surrounding contexts, accordingly we inject the graph structure into the attention mechanism to form a graph attention layer so that each input node can only attend to its neighborhoods. The input nodes of each graph attention layer are composed of textual, visual, and positional features from semantically meaningful regions in a document image. We do the multimodal feature fusion of each node by the gate fusion layer. The contextualization between each node is modeled by the graph attention layer. GraphDoc learns a generic representation from only 320k unlabeled documents via the Masked Sentence Modeling task. Extensive experimental results on the publicly available datasets show that GraphDoc achieves state-of-the-art performance, which demonstrates the effectiveness of our proposed method. The code is available at https://github.com/ZZR8066/GraphDoc.

updated: Sun Oct 23 2022 16:12:10 GMT+0000 (UTC)

published: Fri Mar 25 2022 09:27:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト