Improving Selective Visual Question Answering by Learning from Your Peers

Corentin Dancette; Spencer Whitehead; Rishabh Maheshwary; Ramakrishna Vedantam; Stefan Scherer; Xinlei Chen; Matthieu Cord; Marcus Rohrbach

同僚から学ぶことで、選択的な視覚的な質問への回答を改善する

Visual Question Answering (VQA) の進歩にもかかわらず、モデル自体の正しさを評価するモデルの機能はまだ研究されていません。最近の研究によると、そのままの VQA モデルでは、間違っている場合に回答を控えるのが困難になる可能性があります。選択的予測とも呼ばれる棄権オプションは、システムの出力を信頼する必要があるユーザー (視覚障害を持つユーザーのための VQA アシスタントなど) にシステムを導入する場合に非常に重要です。このようなシナリオでは、ユーザーが不正解の可能性を高める配信外 (OOD) または敵対的な入力を提供する可能性があるため、棄権が特に重要になる可能性があります。この作業では、モデルが ID と OOD データの混合で提示される、ディストリビューション (ID) シナリオと OOD シナリオの両方で選択的 VQA を調査します。目標は、質問に対するエラーのリスクを最小限に抑えながら、回答される質問の数を最大化することです。私たちは、棄権決定を行うためのマルチモーダル選択機能をトレーニングするための、シンプルかつ効果的な Learning from Your Peers (LYP) アプローチを提案します。私たちのアプローチでは、トレーニングデータの個別のサブセットでトレーニングされたモデルからの予測を、選択的 VQA モデルを最適化するためのターゲットとして使用します。追加の手動ラベルや保持データは必要なく、一般化するのが簡単な例と難しい例を識別するためのシグナルを提供します。私たちの広範な評価では、これがさまざまなアーキテクチャや規模の多くのモデルにメリットをもたらすことを示しています。全体として、ID については、1% のエラーリスク (C@1%) で選択的予測メトリックのカバレッジが 32.92% に達し、このタスクにおける以前の最高カバレッジ 15.79% の 2 倍になりました。 ID/OOD 混合の場合、棄権決定にモデルのソフトマックス信頼度を使用すると、パフォーマンスが非常に低くなり、OOD の例が 10% しかない場合でも、エラーのリスクが 1% で質問の 5% 未満に回答できますが、LYP を使用した学習された選択関数により、その確率が向上する可能性があります。 25.38%C@1%まで。

Despite advances in Visual Question Answering (VQA), the ability of models to assess their own correctness remains underexplored. Recent work has shown that VQA models, out-of-the-box, can have difficulties abstaining from answering when they are wrong. The option to abstain, also called Selective Prediction, is highly relevant when deploying systems to users who must trust the system's output (e.g., VQA assistants for users with visual impairments). For such scenarios, abstention can be especially important as users may provide out-of-distribution (OOD) or adversarial inputs that make incorrect answers more likely. In this work, we explore Selective VQA in both in-distribution (ID) and OOD scenarios, where models are presented with mixtures of ID and OOD data. The goal is to maximize the number of questions answered while minimizing the risk of error on those questions. We propose a simple yet effective Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions. Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model. It does not require additional manual labels or held-out data and provides a signal for identifying examples that are easy/difficult to generalize to. In our extensive evaluations, we show this benefits a number of models across different architectures and scales. Overall, for ID, we reach 32.92% in the selective prediction metric coverage at 1% risk of error (C@1%) which doubles the previous best coverage of 15.79% on this task. For mixed ID/OOD, using models' softmax confidences for abstention decisions performs very poorly, answering <5% of questions at 1% risk of error even when faced with only 10% OOD examples, but a learned selection function with LYP can increase that to 25.38% C@1%.

updated: Wed Jun 14 2023 21:22:01 GMT+0000 (UTC)

published: Wed Jun 14 2023 21:22:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト