VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups

Zejiang Shen; Kyle Lo; Lucy Lu Wang; Bailey Kuehl; Daniel S. Weld; Doug Downey

VILA：ビジュアルレイアウトグループを使用したScientificPDFからの構造化コンテンツ抽出の改善

PDFから構造化コンテンツを正確に抽出することは、科学論文に対するNLPの重要な最初のステップです。最近の作業では、基本的なレイアウト情報（ページ上の各トークンの2D位置など）を言語モデルの事前トレーニングに組み込むことで、抽出の精度が向上しています。パフォーマンスをさらに向上させるために、VIsual LAyout（VILA）グループ、つまりテキスト行またはテキストブロックを明示的にモデル化する新しいメソッドを紹介します。 I-VILAアプローチでは、レイアウトグループの境界を示す特別なトークンをモデル入力に挿入するだけで、トークン分類のマクロF1が1.9％向上する可能性があることを示しています。 H-VILAアプローチでは、レイアウトグループの階層的エンコーディングにより、マクロF1の損失が0.8％未満で、推論時間が最大47％短縮される可能性があることを示しています。以前のレイアウト対応アプローチとは異なり、私たちの方法は、高価な追加の事前トレーニングを必要とせず、微調整のみを必要とします。これにより、トレーニングコストを最大95％削減できることが示されています。実験は、既存の自動ラベル付けされたデータセットを統合し、19の科学分野からの多様な論文をカバーする手動注釈の新しいデータセットを含む、新しくキュレートされた評価スイートS2-VLUEで実施されます。事前にトレーニングされた重み、ベンチマークデータセット、およびソースコードは、https：//github.com/allenai/VILAで入手できます。

Accurately extracting structured content from PDFs is a critical first step for NLP over scientific papers. Recent work has improved extraction accuracy by incorporating elementary layout information, e.g., each token's 2D position on the page, into language model pretraining. We introduce new methods that explicitly model VIsual LAyout (VILA) groups, i.e., text lines or text blocks, to further improve performance. In our I-VILA approach, we show that simply inserting special tokens denoting layout group boundaries into model inputs can lead to a 1.9% Macro F1 improvement in token classification. In the H-VILA approach, we show that hierarchical encoding of layout-groups can result in up-to 47% inference time reduction with less than 0.8% Macro F1 loss. Unlike prior layout-aware approaches, our methods do not require expensive additional pretraining, only fine-tuning, which we show can reduce training cost by up to 95%. Experiments are conducted on a newly curated evaluation suite, S2-VLUE, that unifies existing automatically-labeled datasets and includes a new dataset of manual annotations covering diverse papers from 19 scientific disciplines. Pre-trained weights, benchmark datasets, and source code are available at https://github.com/allenai/VILA.

updated: Wed Jan 05 2022 15:59:32 GMT+0000 (UTC)

published: Tue Jun 01 2021 17:59:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト