Point and Ask: Incorporating Pointing into Visual Question Answering

Arjun Mani; Will Hinthorn; Nobline Yoo; Olga Russakovsky

ポイントアンドアスク：視覚的な質問応答にポインティングを組み込む

視覚的質問応答（VQA）は、視覚的認識の進歩の主要なベンチマークの1つになっています。さまざまな質問の定式化、トレーニングとテストの分布の変更、会話の会話の一貫性、説明ベースの回答など、実際の設定をより適切にシミュレートするために、複数のVQA拡張機能が検討されています。この作品では、空間的な参照点を含む視覚的な質問を検討することにより、この空間をさらに拡大します。ポインティングは人間の間でほぼ普遍的なジェスチャーであり、実際のVQAにはターゲット領域へのジェスチャーが含まれる可能性があります。具体的には、（1）VQAの拡張としてポイント入力質問を導入して動機付け、（2）このスペース内に3つの新しいクラスの質問を定義し、（3）クラスごとに、ベンチマークデータセットと一連のベースラインの両方を導入します。その固有の課題を処理するためのモデル。以前の作業との2つの重要な違いがあります。まず、ポイント入力を必要とするベンチマークを明示的に設計します。つまり、空間参照なしでは視覚的な質問に正確に答えることができないようにします。次に、標準ではあるが不自然な境界ボックス入力ではなく、より現実的なポイント空間入力を明示的に調査します。私たちの調査を通じて、人間の意図を推測する能力、画像についてローカルおよびグローバルの両方で推論する能力、視覚、言語、および空間入力を効果的に組み合わせる能力など、いくつかの視覚認識の課題を発見して対処します。コードはhttps://github.com/princetonvisualai/pointingqaで入手できます。

Visual Question Answering (VQA) has become one of the key benchmarks of visual recognition progress. Multiple VQA extensions have been explored to better simulate real-world settings: different question formulations, changing training and test distributions, conversational consistency in dialogues, and explanation-based answering. In this work, we further expand this space by considering visual questions that include a spatial point of reference. Pointing is a nearly universal gesture among humans, and real-world VQA is likely to involve a gesture towards the target region. Concretely, we (1) introduce and motivate point-input questions as an extension of VQA, (2) define three novel classes of questions within this space, and (3) for each class, introduce both a benchmark dataset and a series of baseline models to handle its unique challenges. There are two key distinctions from prior work. First, we explicitly design the benchmarks to require the point input, i.e., we ensure that the visual question cannot be answered accurately without the spatial reference. Second, we explicitly explore the more realistic point spatial input rather than the standard but unnatural bounding box input. Through our exploration we uncover and address several visual recognition challenges, including the ability to infer human intent, reason both locally and globally about the image, and effectively combine visual, language and spatial inputs. Code is available at: https://github.com/princetonvisualai/pointingqa .

updated: Wed Jun 16 2021 16:54:24 GMT+0000 (UTC)

published: Fri Nov 27 2020 11:43:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト