A Picture May Be Worth a Hundred Words for Visual Question Answering

Yusuke Hirota; Noa Garcia; Mayu Otani; Chenhui Chu; Yuta Nakashima; Ittetsu Taniguchi; Takao Onoye

写真は視覚的な質問応答のために百の言葉に値するかもしれません

写真を理解するために、テキスト表現をどこまで進めることができますか？画像の理解では、簡潔で詳細な画像表現を使用することが不可欠です。 Faster R-CNNなどの視覚モデルによって抽出された深い視覚的特徴は、複数のタスク、特に視覚的な質問応答（VQA）で広く使用されています。ただし、従来の深い視覚的特徴は、私たち人間のように画像のすべての詳細を伝えるのに苦労する可能性があります。一方、最近の言語モデルの進歩により、説明テキストがこの問題の代替になる可能性があります。このホワイトペーパーでは、VQAの特定のコンテキストでの画像理解のためのテキスト表現の有効性について詳しく説明します。深い視覚的特徴ではなく、説明と質問のペアを入力として受け取り、それらを言語のみのTransformerモデルにフィードして、プロセスと計算コストを簡素化することを提案します。また、トレーニングセットの多様性を高め、統計的偏りの学習を回避するために、データ拡張手法を実験します。広範な評価により、テキスト表現は、VQA2.0とVQA-CPv2の両方の深い視覚的機能と競合するために約100語しか必要としないことが示されています。

How far can we go with textual representations for understanding pictures? In image understanding, it is essential to use concise but detailed image representations. Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks, and especially in visual question answering (VQA). However, conventional deep visual features may struggle to convey all the details in an image as we humans do. Meanwhile, with recent language models' progress, descriptive text may be an alternative to this problem. This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA. We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model, simplifying the process and the computational cost. We also experiment with data augmentation techniques to increase the diversity in the training set and avoid learning statistical bias. Extensive evaluations have shown that textual representations require only about a hundred words to compete with deep visual features on both VQA 2.0 and VQA-CP v2.

updated: Fri Jun 25 2021 06:13:14 GMT+0000 (UTC)

published: Fri Jun 25 2021 06:13:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト