Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking

Dirk Väth; Pascal Tilli; Ngoc Thang Vu

精度を超えて：ベンチマークに答える視覚的な質問のための統合ツール

任意の質問に答えることができる一般的な視覚的質問応答（VQA）システムに向かう途中で、特定のデータセットの単一メトリックリーダーボードを超えた評価の必要性が生じます。この目的のために、研究者やチャレンジオーガナイザー向けに、ブラウザベースのベンチマークツールを提案します。APIを使用すると、新しいモデルとデータセットを簡単に統合して、急速に変化するVQAの状況に対応できます。私たちのツールは、複数のデータセットにわたるモデルの一般化機能をテストするのに役立ち、精度だけでなく、入力ノイズに対するロバスト性など、より現実的な現実のシナリオでのパフォーマンスも評価します。さらに、モデルの動作をさらに説明するために、バイアスと不確実性を測定するメトリックが含まれています。インタラクティブフィルタリングにより、データサンプルレベルに至るまで、問題のある動作の発見が容易になります。概念実証として、4つのモデルでケーススタディを実行します。最先端のVQAモデルは特定のタスクまたはデータセット用に最適化されていますが、画像内のテキストを認識できないなど、他のドメイン内テストセットにも一般化できないことがわかりました。私たちのメトリックは、どの画像と質問の埋め込みがモデルに最も堅牢性を提供するかを定量化することを可能にします。すべてのコードは公開されています。

On the way towards general Visual Question Answering (VQA) systems that are able to answer arbitrary questions, the need arises for evaluation beyond single-metric leaderboards for specific datasets. To this end, we propose a browser-based benchmarking tool for researchers and challenge organizers, with an API for easy integration of new models and datasets to keep up with the fast-changing landscape of VQA. Our tool helps test generalization capabilities of models across multiple datasets, evaluating not just accuracy, but also performance in more realistic real-world scenarios such as robustness to input noise. Additionally, we include metrics that measure biases and uncertainty, to further explain model behavior. Interactive filtering facilitates discovery of problematic behavior, down to the data sample level. As proof of concept, we perform a case study on four models. We find that state-of-the-art VQA models are optimized for specific tasks or datasets, but fail to generalize even to other in-domain test sets, for example they cannot recognize text in images. Our metrics allow us to quantify which image and question embeddings provide most robustness to a model. All code is publicly available.

updated: Mon Oct 11 2021 11:08:35 GMT+0000 (UTC)

published: Mon Oct 11 2021 11:08:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト