Which visual questions are difficult to answer? Analysis with Entropy of Answer Distributions

Kento Terao; Toru Tamaki; Bisser Raytchev; Kazufumi Kaneda; Shun'ichi Satoh

どの視覚的な質問に答えるのが難しいですか？回答分布のエントロピーによる分析

直接的な監督や難易度の注釈なしに、視覚的質問応答（VQA）の視覚的質問の難易度を識別するための新しいアプローチを提案します。これまでの研究では、人間の注釈者の根本的な答えの多様性を検討してきました。対照的に、複数の異なるVQAモデルの動作に基づいて視覚的な質問の難しさを分析します。 3つの異なるモデルによって得られた予測回答分布のエントロピー値をクラスター化することを提案します。入力画像と質問として使用するベースライン法と、入力画像と質問のみとして使用する2つのバリアントです。単純なk平均法を使用して、VQA v2検証セットの視覚的な質問をクラスター化します。次に、最先端の方法を使用して、各クラスターの回答分布の精度とエントロピーを決定します。提案された方法の利点は、各クラスターの精度がそれに属する視覚的な質問の難易度を反映するため、難易度の注釈が必要ないことです。私たちのアプローチは、最先端の方法では正しく答えられない難しい視覚的な質問のクラスターを特定できます。 VQA v2データセットの詳細な分析により、1）すべてのメソッドが最も困難なクラスター（約10％の精度）で低いパフォーマンスを示している、2）クラスターの難易度が高くなるにつれて、さまざまなメソッドによって予測された回答が異なり始める、および3）クラスタエントロピーの値は、クラスタの精度と高度に相関しています。私たちのアプローチには、クラスターの1つに割り当てることにより、グラウンドトゥルース（つまり、VQA v2のテストセット）なしで視覚的な質問の難易度を評価できるという利点があることを示しています。これは、研究の新しい方向性と新しいアルゴリズムの開発を刺激できると期待しています。クラスタリングの結果は、https：//github.com/tttamaki/vqdからオンラインで入手できます。

We propose a novel approach to identify the difficulty of visual questions for Visual Question Answering (VQA) without direct supervision or annotations to the difficulty. Prior works have considered the diversity of ground-truth answers of human annotators. In contrast, we analyze the difficulty of visual questions based on the behavior of multiple different VQA models. We propose to cluster the entropy values of the predicted answer distributions obtained by three different models: a baseline method that takes as input images and questions, and two variants that take as input images only and questions only. We use a simple k-means to cluster the visual questions of the VQA v2 validation set. Then we use state-of-the-art methods to determine the accuracy and the entropy of the answer distributions for each cluster. A benefit of the proposed method is that no annotation of the difficulty is required, because the accuracy of each cluster reflects the difficulty of visual questions that belong to it. Our approach can identify clusters of difficult visual questions that are not answered correctly by state-of-the-art methods. Detailed analysis on the VQA v2 dataset reveals that 1) all methods show poor performances on the most difficult cluster (about 10% accuracy), 2) as the cluster difficulty increases, the answers predicted by the different methods begin to differ, and 3) the values of cluster entropy are highly correlated with the cluster accuracy. We show that our approach has the advantage of being able to assess the difficulty of visual questions without ground-truth (i.e. the test set of VQA v2) by assigning them to one of the clusters. We expect that this can stimulate the development of novel directions of research and new algorithms. Clustering results are available online at https://github.com/tttamaki/vqd .

updated: Mon Dec 07 2020 22:40:58 GMT+0000 (UTC)

published: Sun Apr 12 2020 12:06:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト