Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly

Spencer Whitehead; Suzanne Petryk; Vedaad Shakib; Joseph Gonzalez; Trevor Darrell; Anna Rohrbach; Marcus Rohrbach

信頼できる視覚的質問応答: 間違って答えるよりも棄権する

機械学習は劇的に進歩し、視覚的質問応答 (VQA) などのマルチモーダルタスクで人間との精度差が縮まりました。しかし、人間は不確かなときは「わからない」と言うことができますが (つまり、質問に答えるのを控える)、そのような能力は、実際の VQA の使用に対するこの問題の重要性にもかかわらず、マルチモーダル研究ではほとんど無視されてきました。設定。この作業では、不正確な回答を提供するよりも棄権を好む、信頼できる VQA の問題定式化を促進します。最初に、いくつかの VQA モデルの棄権機能を有効にし、それらのカバレッジ (回答された質問の部分) とリスク (その部分のエラー) の両方を分析します。そのために、いくつかの棄権アプローチを検討します。最高のパフォーマンスを発揮するモデルは VQA v2 データセットで 70% を超える精度を達成しますが、モデルのソフトマックススコアを直接使用して棄権するオプションを導入すると、エラーのリスクを低く抑えるために、質問の 7.5% 未満に回答するように制限されることがわかりました (すなわち、1%)。これにより、マルチモーダル選択関数を利用して、予測された回答の正しさを直接推定するようになります。これにより、たとえば、1% のリスクで 6.8% から 15.6% にカバレッジを 2.3 倍に増やすことができます。カバレッジとリスクの両方を分析することは重要ですが、これらの指標にはトレードオフがあり、VQA モデルの比較は困難です。これに対処するために、棄権と比較して不正確な回答に大きなコストをかける VQA の実効信頼性メトリックも提案します。 VQA のためのこの新しい問題の定式化、測定基準、および分析は、効果的で信頼できる VQA モデルを構築するための土台を提供します。このモデルは、答えがわからない場合にのみ棄権する自己認識を備えています。

Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA). However, while humans can say "I don't know" when they are uncertain (i.e., abstain from answering a question), such ability has been largely neglected in multimodal research, despite the importance of this problem to the usage of VQA in real settings. In this work, we promote a problem formulation for reliable VQA, where we prefer abstention over providing an incorrect answer. We first enable abstention capabilities for several VQA models, and analyze both their coverage, the portion of questions answered, and risk, the error on that portion. For that, we explore several abstention approaches. We find that although the best performing models achieve over 70% accuracy on the VQA v2 dataset, introducing the option to abstain by directly using a model's softmax scores limits them to answering less than 7.5% of the questions to achieve a low risk of error (i.e., 1%). This motivates us to utilize a multimodal selection function to directly estimate the correctness of the predicted answers, which we show can increase the coverage by, for example, 2.3x from 6.8% to 15.6% at 1% risk. While it is important to analyze both coverage and risk, these metrics have a trade-off which makes comparing VQA models challenging. To address this, we also propose an Effective Reliability metric for VQA that places a larger cost on incorrect answers compared to abstentions. This new problem formulation, metric, and analysis for VQA provide the groundwork for building effective and reliable VQA models that have the self-awareness to abstain if and only if they don't know the answer.

updated: Thu Oct 20 2022 17:36:51 GMT+0000 (UTC)

published: Thu Apr 28 2022 16:51:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト