CoVA: Context-aware Visual Attention for Webpage Information Extraction

Anurendra Kumar; Keval Morabia; Jingjin Wang; Kevin Chen-Chuan Chang; Alexander Schwing

CoVA：Webページ情報抽出のためのコンテキストアウェアな視覚的注意

Webページ情報抽出（WIE）は、知識ベースを作成するための重要なステップです。このために、従来のWIEメソッドはWebサイトのドキュメントオブジェクトモデル（DOM）ツリーを活用します。ただし、コンテキストと外観は抽象的な方法でエンコードされるため、DOMツリーの使用には重大な課題があります。この課題に対処するために、WIEをコンテキストアウェアなWebページオブジェクト検出タスクとして再定式化することを提案します。具体的には、外観機能とDOMツリーの構文構造を組み合わせたコンテキストアウェアな視覚的注意ベース（CoVA）検出パイプラインを開発します。このアプローチを研究するために、eコマースWebサイトの新しい大規模なデータセットを収集し、すべてのWeb要素に製品価格、製品タイトル、製品画像、背景の4つのラベルを手動で注釈を付けます。このデータセットでは、提案されたCoVAアプローチが、以前の最先端の方法を改善する新しい挑戦的なベースラインであることを示しています。

Webpage information extraction (WIE) is an important step to create knowledge bases. For this, classical WIE methods leverage the Document Object Model (DOM) tree of a website. However, use of the DOM tree poses significant challenges as context and appearance are encoded in an abstract manner. To address this challenge we propose to reformulate WIE as a context-aware Webpage Object Detection task. Specifically, we develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree. To study the approach we collect a new large-scale dataset of e-commerce websites for which we manually annotate every web element with four labels: product price, product title, product image and background. On this dataset we show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.

updated: Sun Oct 24 2021 00:21:46 GMT+0000 (UTC)

published: Sun Oct 24 2021 00:21:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト