SelfDoc: Self-Supervised Document Representation Learning

Peizhao Li; Jiuxiang Gu; Jason Kuen; Vlad I. Morariu; Handong Zhao; Rajiv Jain; Varun Manjunatha; Hongfu Liu

SelfDoc: 自己教師ありドキュメント表現学習

私たちは、ドキュメント画像を理解するためのタスクにとらわれない事前トレーニングフレームワークである SelfDoc を提案します。ドキュメントはマルチモーダルであり、順次読むことを目的としているため、私たちのフレームワークは、ドキュメント内の意味的に意味のあるすべてのコンポーネントの位置、テキスト、および視覚情報を活用し、コンテンツの各ブロック間のコンテキスト化をモデル化します。既存のドキュメントの事前トレーニングモデルとは異なり、私たちのモデルは個々の単語を入力として扱うのではなく、粒度が粗いため、過度の文脈化による過度に粒度の細かいものを避けることができます。さらに、モデルの事前トレーニングフェーズでクロスモーダルラーニングを導入して、ラベルのないドキュメントからのマルチモーダル情報を完全に活用します。ダウンストリームでの使用については、言語と視覚信号を適応的に強調することにより、マルチモーダル機能融合のための新しいモダリティ適応アテンションメカニズムを提案します。私たちのフレームワークは、特徴マスキングトレーニング戦略による注釈を必要とせずに、ドキュメントに対する自己監視型の事前トレーニングの恩恵を受けています。以前の作業と比較して、トレーニング前の段階で使用されるドキュメント画像が大幅に少なくなるため、複数のダウンストリームタスクで優れたパフォーマンスを実現します。

We propose SelfDoc, a task-agnostic pre-training framework for document image understanding. Because documents are multimodal and are intended for sequential reading, our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document, and it models the contextualization between each block of content. Unlike existing document pre-training models, our model is coarse-grained instead of treating individual words as input, therefore avoiding an overly fine-grained with excessive contextualization. Beyond that, we introduce cross-modal learning in the model pre-training phase to fully leverage multimodal information from unlabeled documents. For downstream usage, we propose a novel modality-adaptive attention mechanism for multimodal feature fusion by adaptively emphasizing language and vision signals. Our framework benefits from self-supervised pre-training on documents without requiring annotations by a feature masking training strategy. It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.

updated: Mon Jun 07 2021 04:19:49 GMT+0000 (UTC)

published: Mon Jun 07 2021 04:19:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト