Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling

Yongshuai Huang; Ning Lu; Dapeng Chen; Yibo Li; Zecheng Xie; Shenggao Zhu; Liangcai Gao; Wei Peng

Visual-Alignment Sequential Coordinate Modeling による表構造認識の改善

表構造認識は、構造化されていない表イメージの論理的および物理的構造を機械可読形式に抽出することを目的としています。最新のエンドツーエンドの画像からテキストへのアプローチは、2 つのデコーダーによって 2 つの構造を同時に予測します。物理構造 (セルの境界ボックス) の予測は、論理構造の表現に基づいています。ただし、以前の方法では、論理表現に局所的な視覚情報が欠けているため、不正確なバウンディングボックスに苦労していました。この問題に対処するために、VAST と呼ばれるテーブル構造認識のためのエンドツーエンドのシーケンシャルモデリングフレームワークを提案します。これには、論理構造デコーダーからの空でないセルの表現によってトリガーされる新しい座標シーケンスデコーダーが含まれています。座標シーケンスデコーダーでは、バウンディングボックスの座標を言語シーケンスとしてモデル化し、左、上、右、および下の座標を順番にデコードして、座標間の依存関係を活用します。さらに、空でないセルの論理表現に、より局所的な視覚的詳細を含めるよう強制するための補助的な視覚的配置損失を提案します。これにより、セルの境界ボックスをより適切に作成できます。広範な実験により、提案された方法が論理構造認識と物理構造認識の両方で最先端の結果を達成できることが実証されています。アブレーション研究は、提案された座標シーケンスデコーダーと視覚的アライメントの損失が、この方法の成功の鍵であることも検証します。

Table structure recognition aims to extract the logical and physical structure of unstructured table images into a machine-readable format. The latest end-to-end image-to-text approaches simultaneously predict the two structures by two decoders, where the prediction of the physical structure (the bounding boxes of the cells) is based on the representation of the logical structure. However, the previous methods struggle with imprecise bounding boxes as the logical representation lacks local visual information. To address this issue, we propose an end-to-end sequential modeling framework for table structure recognition called VAST. It contains a novel coordinate sequence decoder triggered by the representation of the non-empty cell from the logical structure decoder. In the coordinate sequence decoder, we model the bounding box coordinates as a language sequence, where the left, top, right and bottom coordinates are decoded sequentially to leverage the inter-coordinate dependency. Furthermore, we propose an auxiliary visual-alignment loss to enforce the logical representation of the non-empty cells to contain more local visual details, which helps produce better cell bounding boxes. Extensive experiments demonstrate that our proposed method can achieve state-of-the-art results in both logical and physical structure recognition. The ablation study also validates that the proposed coordinate sequence decoder and the visual-alignment loss are the keys to the success of our method.

updated: Mon Mar 13 2023 09:34:08 GMT+0000 (UTC)

published: Mon Mar 13 2023 09:34:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト