TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

Zhengyuan Yang; Yijuan Lu; Jianfeng Wang; Xi Yin; Dinei Florencio; Lijuan Wang; Cha Zhang; Lei Zhang; Jiebo Luo

TAP：Text-VQAおよびText-Captionのテキスト対応事前トレーニング

この論文では、Text-VQAおよびText-Captionタスク用のText-Aware Pre-training（TAP）を提案します。これらの2つのタスクは、それぞれ質問応答と画像キャプション生成のために画像内のシーンテキストを読んで理解することを目的としています。シーンテキストのキャプチャに失敗する従来の視覚言語の事前トレーニングと、視覚およびテキストのモダリティとの関係とは対照的に、TAPは事前トレーニングにシーンテキスト（OCRエンジンから生成）を明示的に組み込みます。マスク言語モデリング（MLM）、画像テキスト（対照）マッチング（ITM）、相対（空間）位置予測（RPP）を含む3つの事前トレーニングタスクにより、TAPはモデルが3つの中でより適切に調整された表現を学習するのを効果的に支援しますモダリティ：テキストワード、ビジュアルオブジェクト、シーンテキスト。この整列された表現学習により、同じダウンストリームタスクデータセットで事前トレーニングされていても、TAPは非TAPベースラインと比較してTextVQAデータセットの絶対精度をすでに+ 5.4％向上させています。パフォーマンスをさらに向上させるために、OCR-CCという名前のConceptual Captionデータセットに基づいて大規模なデータセットを構築します。このデータセットには、140万のシーンテキスト関連の画像とテキストのペアが含まれています。このOCR-CCデータセットで事前トレーニングされたこのアプローチは、複数のタスクで最先端の技術を大幅に上回っています。つまり、TextVQAで+ 8.3％の精度、ST-VQAで+ 8.6％の精度、TextCapsで+ 10.2CIDErスコアです。。

In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. These two tasks aim at reading and understanding scene text in images for question answering and image caption generation, respectively. In contrast to the conventional vision-language pre-training that fails to capture scene text and its relationship with the visual and text modalities, TAP explicitly incorporates scene text (generated from OCR engines) in pre-training. With three pre-training tasks, including masked language modeling (MLM), image-text (contrastive) matching (ITM), and relative (spatial) position prediction (RPP), TAP effectively helps the model learn a better aligned representation among the three modalities: text word, visual object, and scene text. Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5.4%, compared with a non-TAP baseline. To further improve the performance, we build a large-scale dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1.4 million scene text-related image-text pairs. Pre-trained on this OCR-CC dataset, our approach outperforms the state of the art by large margins on multiple tasks, i.e., +8.3% accuracy on TextVQA, +8.6% accuracy on ST-VQA, and +10.2 CIDEr score on TextCaps.

updated: Tue Dec 08 2020 18:55:21 GMT+0000 (UTC)

published: Tue Dec 08 2020 18:55:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト