Greedy Gradient Ensemble for Robust Visual Question Answering

Xinzhe Han; Shuhui Wang; Chi Su; Qingming Huang; Qi Tian

堅牢な視覚的質問応答のための貪欲なグラデーションアンサンブル

言語の偏りは、視覚的な質問応答（VQA）の重要な問題であり、モデルは画像情報を考慮せずに最終決定のためにデータセットの偏りを利用することがよくあります。その結果、配布外のデータのパフォーマンスが低下し、視覚的な説明が不十分になるという問題があります。既存の堅牢なVQA手法の実験的分析に基づいて、2つの側面、つまり分布バイアスとショートカットバイアスから生じるVQAの言語バイアスを強調します。さらに、バイアスのない基本モデル学習のために複数のバイアスモデルを組み合わせた新しいバイアス除去フレームワークであるGreedy Gradient Ensemble（GGE）を提案します。欲張り戦略では、GGEはバイアスされたモデルにバイアスされたデータ分布を優先的にオーバーフィットさせるため、ベースモデルはバイアスされたモデルでは解決が難しい例により多くの注意を払うようになります。実験は、私たちの方法が視覚情報をより有効に活用し、追加の注釈を使用せずにデータセットVQA-CPの診断で最先端のパフォーマンスを達成することを示しています。

Language bias is a critical issue in Visual Question Answering (VQA), where models often exploit dataset biases for the final decision without considering the image information. As a result, they suffer from performance drop on out-of-distribution data and inadequate visual explanation. Based on experimental analysis for existing robust VQA methods, we stress the language bias in VQA that comes from two aspects, i.e., distribution bias and shortcut bias. We further propose a new de-bias framework, Greedy Gradient Ensemble (GGE), which combines multiple biased models for unbiased base model learning. With the greedy strategy, GGE forces the biased models to over-fit the biased data distribution in priority, thus makes the base model pay more attention to examples that are hard to solve by biased models. The experiments demonstrate that our method makes better use of visual information and achieves state-of-the-art performance on diagnosing dataset VQA-CP without using extra annotations.

updated: Mon Aug 09 2021 13:36:51 GMT+0000 (UTC)

published: Tue Jul 27 2021 08:02:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト