Visual Question Answering Using Semantic Information from Image Descriptions

Tasmia Tasrin; Md Sultan Al Nahian; Brent Harrison

画像記述からの意味情報を使用した視覚的な質問応答

この作品では、領域ベースの画像の特徴、尋ねられる自然言語の質問、および画像の領域から抽出された意味知識を利用して、視覚的に尋ねられる質問に対する自由形式の回答を生成する注意メカニズムを使用するディープニューラルアーキテクチャを提案します。質問応答（VQA）タスク。画像に関する地域ベースの機能と地域ベースのテキスト情報の両方を組み合わせることで、モデルが強化され、質問に正確に回答できるようになり、必要なトレーニングデータが少なくて済む可能性があります。強力なベースラインに対してVQAタスクで提案されたアーキテクチャを評価し、この方法でこのタスクで優れた結果が得られることを示します。

In this work, we propose a deep neural architecture that uses an attention mechanism which utilizes region based image features, the natural language question asked, and semantic knowledge extracted from the regions of an image to produce open-ended answers for questions asked in a visual question answering (VQA) task. The combination of both region based features and region based textual information about the image bolsters a model to more accurately respond to questions and potentially do so with less required training data. We evaluate our proposed architecture on a VQA task against a strong baseline and show that our method achieves excellent results on this task.

updated: Sat Apr 03 2021 18:09:22 GMT+0000 (UTC)

published: Thu Apr 23 2020 04:35:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト