Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering

Jihyung Kil; Cheng Zhang; Dong Xuan; Wei-Lun Chao

未知の既知の発見：データセット内の暗黙知を視覚的な質問応答のための明示的なトレーニング例に変える

視覚的な質問応答（VQA）は、モデルがマルチモーダル情報を処理する必要があるだけでなく、十分なトレーニング例を収集するのが非常に難しいため、困難です。画像について尋ねることができる質問が多すぎます。その結果、人間が注釈を付けた例のみでトレーニングされたVQAモデルは、質問されている特定の質問スタイルや画像コンテンツに簡単に適合しすぎて、質問の多様性についてモデルをほとんど知らないままにする可能性があります。既存の方法は、主に視覚的な接地、サイクルの一貫性、バイアス除去などの補助タスクを導入することでこの問題に対処します。このホワイトペーパーでは、大幅に異なるアプローチを採用しています。学習したVQAモデルの「不明」の多くは、データセット内で暗黙的に「既知」であることがわかりました。たとえば、異なる画像内の同じオブジェクトについて尋ねる質問は言い換えである可能性があります。画像内で検出または注釈が付けられたオブジェクトの数は、質問にその画像の注釈が付けられていない場合でも、「いくつ」の質問に対する答えをすでに提供しています。これらの洞察に基づいて、この「既知の」知識をVQAのトレーニング例に変えるための単純なデータ拡張パイプラインSimpleAugを紹介します。これらの拡張された例は、言語の前のシフトがあるVQA-CPデータセットだけでなく、そのようなシフトのないVQA v2データセットでも、学習されたVQAモデルのパフォーマンスを著しく改善できることを示します。私たちの方法は、VQAモデルを強化するために、原則的な方法で弱くラベル付けされた、またはラベル付けされていない画像を活用するための扉をさらに開きます。私たちのコードとデータはhttps://github.com/heendung/simpleAUGで公開されています。

Visual question answering (VQA) is challenging not only because the model has to handle multi-modal information, but also because it is just so hard to collect sufficient training examples -- there are too many questions one can ask about an image. As a result, a VQA model trained solely on human-annotated examples could easily over-fit specific question styles or image contents that are being asked, leaving the model largely ignorant about the sheer diversity of questions. Existing methods address this issue primarily by introducing an auxiliary task such as visual grounding, cycle consistency, or debiasing. In this paper, we take a drastically different approach. We found that many of the "unknowns" to the learned VQA model are indeed "known" in the dataset implicitly. For instance, questions asking about the same object in different images are likely paraphrases; the number of detected or annotated objects in an image already provides the answer to the "how many" question, even if the question has not been annotated for that image. Building upon these insights, we present a simple data augmentation pipeline SimpleAug to turn this "known" knowledge into training examples for VQA. We show that these augmented examples can notably improve the learned VQA models' performance, not only on the VQA-CP dataset with language prior shifts but also on the VQA v2 dataset without such shifts. Our method further opens up the door to leverage weakly-labeled or unlabeled images in a principled way to enhance VQA models. Our code and data are publicly available at https://github.com/heendung/simpleAUG.

updated: Mon Sep 13 2021 16:56:43 GMT+0000 (UTC)

published: Mon Sep 13 2021 16:56:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト