Grounding Answers for Visual Questions Asked by Visually Impaired People

Chongyan Chen; Samreen Anjum; Danna Gurari

視覚障害者からの視覚的質問に対する根拠となる回答

視覚的な質問応答は、画像に関する質問に答えるタスクです。 VizWiz-VQA-Groundingデータセットを紹介します。これは、視覚障害を持つ人々からの視覚的な質問への回答を視覚的に根拠付ける最初のデータセットです。データセットを分析し、5つのVQA-Groundingデータセットと比較して、類似点と相違点を示します。次に、SOTA VQAおよびVQA-Groundingモデルを評価し、現在のSOTAアルゴリズムでは、答えがどこにあるかを示す正しい視覚的証拠を特定できないことが多いことを示します。これらのモデルは、視覚的な証拠が画像のごく一部を占める場合、より高品質の画像、およびテキスト認識のスキルを必要とする視覚的な質問に対して、定期的に苦労します。データセット、評価サーバー、リーダーボードはすべて、次のリンクにあります：https：//vizwiz.org/tasks-and-datasets/answer-grounding-for-vqa/。

Visual question answering is the task of answering questions about images. We introduce the VizWiz-VQA-Grounding dataset, the first dataset that visually grounds answers to visual questions asked by people with visual impairments. We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different. We then evaluate the SOTA VQA and VQA-Grounding models and demonstrate that current SOTA algorithms often fail to identify the correct visual evidence where the answer is located. These models regularly struggle when the visual evidence occupies a small fraction of the image, for images that are higher quality, as well as for visual questions that require skills in text recognition. The dataset, evaluation server, and leaderboard all can be found at the following link: https://vizwiz.org/tasks-and-datasets/answer-grounding-for-vqa/.

updated: Fri Apr 08 2022 21:57:30 GMT+0000 (UTC)

published: Fri Feb 04 2022 06:47:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト