Multi-Modal Association based Grouping for Form Structure Extraction

Milan Aggarwal; Mausoom Sarkar; Hiresh Gupta; Balaji Krishnamurthy

フォーム構造抽出のためのマルチモーダルアソシエーションベースのグループ化

ドキュメント構造の抽出は、何十年にもわたって広く研究されてきた分野です。この方向での最近の作業はディープラーニングベースであり、主にセマンティックセグメンテーションによる完全畳み込みNNを使用した構造の抽出に焦点を当てています。この作業では、フォーム構造抽出のための新しいマルチモーダルアプローチを提示します。 textrunやwidgetsなどの単純な要素を前提として、フォームでの情報収集に不可欠なTextBlocks、Text Fields、Choice Fields、ChoiceGroupsなどの高次構造を抽出します。これを実現するために、各低レベル要素（参照）の周囲に最も近い候補要素を特定することにより、ローカル画像パッチを取得します。候補者のテキストおよび空間表現をBiLSTMを介して順次処理し、コンテキスト認識表現を取得し、CNNを介して処理することによって取得された画像パッチ機能と融合します。続いて、逐次デコーダは、この融合された特徴ベクトルを使用して、参照と候補の間の関連タイプを予測します。これらの予測された関連性は、連結成分分析を通じてより大きな構造を決定するために利用されます。実験結果は、上記の構造でそれぞれ90.29％、73.80％、83.12％、52.72％のリコールを達成し、セマンティックセグメンテーションベースラインを大幅に上回ったアプローチの有効性を示しています。個々のモダリティを使用することと比較して、アブレーションを通じて私たちの方法の有効性を示します。また、人間が注釈を付けた新しい豊富なフォームデータセットも紹介します。

Document structure extraction has been a widely researched area for decades. Recent work in this direction has been deep learning-based, mostly focusing on extracting structure using fully convolution NN through semantic segmentation. In this work, we present a novel multi-modal approach for form structure extraction. Given simple elements such as textruns and widgets, we extract higher-order structures such as TextBlocks, Text Fields, Choice Fields, and Choice Groups, which are essential for information collection in forms. To achieve this, we obtain a local image patch around each low-level element (reference) by identifying candidate elements closest to it. We process textual and spatial representation of candidates sequentially through a BiLSTM to obtain context-aware representations and fuse them with image patch features obtained by processing it through a CNN. Subsequently, the sequential decoder takes this fused feature vector to predict the association type between reference and candidates. These predicted associations are utilized to determine larger structures through connected components analysis. Experimental results show the effectiveness of our approach achieving a recall of 90.29%, 73.80%, 83.12%, and 52.72% for the above structures, respectively, outperforming semantic segmentation baselines significantly. We show the efficacy of our method through ablations, comparing it against using individual modalities. We also introduce our new rich human-annotated Forms Dataset.

updated: Fri Jul 09 2021 12:49:34 GMT+0000 (UTC)

published: Fri Jul 09 2021 12:49:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト