External Knowledge enabled Text Visual Question Answering

Arka Ujjal Dey; Ernest Valveny; Gaurav Harit

外部知識対応テキスト視覚的質問応答

Text-VQAの自由形式の質問応答タスクでは、回答を生成するために、画像のローカルな、多くの場合以前は見られなかったシーンテキストコンテンツについて読んで推論する必要があります。この作品では、外部知識の一般化された使用法を提案して、前述のシーンテキストの理解を強化します。ビジョン言語理解タスク用の標準マルチモーダルトランスフォーマーを使用して、知識を抽出、検証、および推論するためのフレームワークを設計します。経験的証拠と定性的結果を通じて、外部の知識がインスタンスのみの手がかりを強調し、トレーニングデータのバイアスに対処し、回答エンティティタイプの正確性を向上させ、マルチワードの名前付きエンティティを検出する方法を示します。同様のアップストリームOCRシステムとトレーニングデータの制約の下で、2つの公開されているデータセットで最先端に匹敵する結果を生成します。

The open-ended question answering task of Text-VQA requires reading and reasoning about local, often previously unseen, scene-text content of an image to generate answers. In this work, we propose the generalized use of external knowledge to augment our understanding of the said scene-text. We design a framework to extract, validate, and reason with knowledge using a standard multimodal transformer for vision language understanding tasks. Through empirical evidence and qualitative results, we demonstrate how external knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities. We generate results comparable to the state-of-the-art on two publicly available datasets, under the constraints of similar upstream OCR systems and training data.

updated: Wed Oct 20 2021 09:50:10 GMT+0000 (UTC)

published: Sun Aug 22 2021 13:21:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト