Test-Time Adaptation for Visual Document Understanding

Sayna Ebrahimi; Sercan O. Arik; Tomas Pfister

ドキュメントを視覚的に理解するためのテスト時の適応

Visual Document Understanding (VDU) の場合、自己教師あり事前トレーニングにより転送可能な表現を生成することに成功していることが示されていますが、テスト時の分布の変化に対するそのような表現の効果的な適応は依然として未踏の領域です。我々は、ラベルのないターゲット文書データを使用してソースフリーのドメイン適応を行う、文書の新しいテスト時適応方法である DocTTA を提案します。 DocTTA は、マスクされたビジュアル言語モデリングによるクロスモダリティ自己教師あり学習と、テスト時にソースドメインで学習されたモデルをラベルのないターゲットドメインに適応させる疑似ラベル付けを利用します。エンティティ認識、キーと値の抽出、ドキュメントの視覚的な質問応答などのさまざまな VDU タスクに対して、既存のパブリックデータセットを使用した新しいベンチマークを導入します。 DocTTA は、ソースモデルのパフォーマンスと比較して、それぞれ最大 1.89% (F1 スコア)、3.43% (F1 スコア)、および 17.68% (ANLS スコア) という大幅な改善を示しています。ベンチマークデータセットは https://saynaebrahimi.github.io/DocTTA.html で入手できます。

For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be an unexplored area. We propose DocTTA, a novel test-time adaptation method for documents, that does source-free domain adaptation using unlabeled target document data. DocTTA leverages cross-modality self-supervised learning via masked visual language modeling, as well as pseudo labeling to adapt models learned on a source domain to an unlabeled target domain at test time. We introduce new benchmarks using existing public datasets for various VDU tasks, including entity recognition, key-value extraction, and document visual question answering. DocTTA shows significant improvements on these compared to the source model performance, up to 1.89% in (F1 score), 3.43% (F1 score), and 17.68% (ANLS score), respectively. Our benchmark datasets are available at https://saynaebrahimi.github.io/DocTTA.html.

updated: Wed Aug 23 2023 22:54:40 GMT+0000 (UTC)

published: Wed Jun 15 2022 01:57:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト