XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding

Zhangxuan Gu; Changhua Meng; Ke Wang; Jun Lan; Weiqiang Wang; Ming Gu; Liqing Zhang

XYLayoutLM：視覚的に豊富なドキュメント理解のためのレイアウト対応マルチモーダルネットワークに向けて

最近、視覚的に豊富なドキュメント理解（VRDU）のためのさまざまなマルチモーダルネットワークが提案され、視覚およびレイアウト情報をテキスト埋め込みと統合することによってトランスフォーマーの促進を示しています。ただし、ほとんどの既存のアプローチでは、位置の埋め込みを利用してシーケンス情報を組み込み、OCRツールによって取得されるノイズの多い不適切な読み取り順序を無視します。この論文では、XYLayoutLMという名前の堅牢なレイアウト対応マルチモーダルネットワークを提案し、AugmentedXYCutによって生成された適切な読み取り順序から豊富なレイアウト情報をキャプチャして活用します。さらに、可変長の入力シーケンスを処理するために拡張条件付き位置エンコーディングモジュールが提案されており、位置の埋め込みを生成しながら、テキストと視覚の両方のモダリティからローカルレイアウト情報をさらに抽出します。実験結果は、XYLayoutLMがドキュメント理解タスクで競争力のある結果を達成することを示しています。

Recently, various multimodal networks for Visually-Rich Document Understanding(VRDU) have been proposed, showing the promotion of transformers by integrating visual and layout information with the text embeddings. However, most existing approaches utilize the position embeddings to incorporate the sequence information, neglecting the noisy improper reading order obtained by OCR tools. In this paper, we propose a robust layout-aware multimodal network named XYLayoutLM to capture and leverage rich layout information from proper reading orders produced by our Augmented XY Cut. Moreover, a Dilated Conditional Position Encoding module is proposed to deal with the input sequence of variable lengths, and it additionally extracts local layout information from both textual and visual modalities while generating position embeddings. Experiment results show that our XYLayoutLM achieves competitive results on document understanding tasks.

updated: Tue Mar 15 2022 14:51:16 GMT+0000 (UTC)

published: Mon Mar 14 2022 09:19:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト