Towards Models that Can See and Read

Roy Ganz; Oren Nuriel; Aviad Aberdam; Yair Kittenplon; Shai Mazor; Ron Litman

見て読めるモデルへ

最も一般的な視覚言語タスクの 1 つである視覚的質問応答 (VQA) と画像キャプション (CAP) には、画像内のテキストからの推論を必要とする類似のシーンテキストバージョンがあります。明らかな類似性にもかかわらず、この 2 つは独立して扱われ、ここで示すように、表示または読み取りのいずれかを実行できるタスク固有のメソッドが生成されますが、両方は実行できません。この作業では、この現象の詳細な分析を行い、既存のマルチモーダルアーキテクチャにシーンテキスト理解機能を付与する、Unified Text-Non-Text アプローチである UniTNT を提案します。具体的には、シーンテキスト情報を追加のモダリティとして扱い、指定されたモジュールを介して事前トレーニング済みのエンコーダー/デコーダーベースのアーキテクチャと融合します。徹底した実験により、UniTNT が両方のタスクタイプを正常に処理する最初の単一モデルにつながることが明らかになりました。さらに、シーンテキスト理解機能により、一般的な VQA と CAP での視覚言語モデルのパフォーマンスを、それぞれ最大 2.69% と 0.6 CIDEr 向上させることができることを示しています。

Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-language models' performance on general VQA and CAP by up to 2.69% and 0.6 CIDEr, respectively.

updated: Tue Mar 21 2023 11:40:47 GMT+0000 (UTC)

published: Wed Jan 18 2023 09:36:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト