PreSTU: Pre-Training for Scene-Text Understanding

Jihyung Kil; Soravit Changpinyo; Xi Chen; Hexiang Hu; Sebastian Goodman; Wei-Lun Chao; Radu Soricut

PreSTU: シーンテキスト理解のための事前トレーニング

視覚入力に埋め込まれたテキストを認識して推論する能力は、視覚と言語 (V&L) モデルに欠けていることがよくあります。これはおそらく、V&L の事前トレーニング方法がそのような能力をトレーニング目標に組み込んでいないことが多いためと考えられます。この論文では、シーンテキスト理解 (STU) に特化した新しい事前トレーニングレシピである PreSTU を提案します。 PreSTU は、モデルが画像からテキストを認識し、それを画像コンテンツの残りの部分に接続することを促進する、OCR 対応の事前トレーニング目標を導入します。シンプルなトランスフォーマーベースのエンコーダー/デコーダーアーキテクチャを使用して PreSTU を実装し、既製の OCR システムから取得したシーンテキストを含む大規模な画像テキストデータセットと組み合わせます。私たちは、8 つの視覚的な質問応答と 4 つの画像キャプションベンチマークに対するこの事前トレーニングアプローチの有効性を経験的に実証します。

The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this paper, we propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content. We implement PreSTU using a simple transformer-based encoder-decoder architecture, combined with large-scale image-text datasets with scene text obtained from an off-the-shelf OCR system. We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.

updated: Sun Aug 20 2023 00:25:44 GMT+0000 (UTC)

published: Mon Sep 12 2022 18:29:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト