Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

Aishwarya Agrawal; Ivana Kajić; Emanuele Bugliarello; Elnaz Davoodi; Anita Gergely; Phil Blunsom; Aida Nematzadeh

視覚的質問応答における評価手法の再考：配布外の一般化に関する事例研究

大規模なマルチモーダルデータで事前トレーニングされた視覚と言語（V＆L）モデルは、画像のキャプションや視覚的な質問応答（VQA）などのさまざまなタスクで強力なパフォーマンスを示しています。このようなモデルの品質は、通常、トレーニングデータと同じ分布から得られる目に見えないデータのパフォーマンスを測定することによって評価されます。ただし、これらのモデルは、VQAのタスクで不十分な分布外（OOD）の一般化を示すことがわかります。不十分な一般化の根本的な原因をよりよく理解するために、クロスデータセット評価を実行することにより、異なる設定（つまり、分類とオープンエンドテキスト生成）での2つの事前トレーニング済みV＆Lモデルのパフォーマンスを包括的に調査します。これらのモデルは、VQAタスクに必要な高レベルのスキルを学習するのではなく、ベンチマークを解決することを学習する傾向があることがわかります。また、ほとんどの場合、生成モデルはデータ分散の変化の影響を受けにくく、テスト済みのベンチマークでパフォーマンスが向上することが多いと主張しています。さらに、マルチモーダル事前トレーニングにより、ほとんどの設定でOODパフォーマンスが向上することがわかりました。最後に、自動VQA評価メトリックの使用の基礎となる仮定を再検討し、それらの厳格な性質がモデルに繰り返しペナルティを課して正しい応答を行うことを経験的に示します。

Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, we observe that these models exhibit poor out-of-distribution (OOD) generalization on the task of VQA. To better understand the underlying causes of poor generalization, we comprehensively investigate performance of two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also argue that in most cases generative models are less susceptible to shifts in data distribution, while frequently performing better on our tested benchmarks. Moreover, we find that multimodal pretraining improves OOD performance in most settings. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.

updated: Tue May 24 2022 16:44:45 GMT+0000 (UTC)

published: Tue May 24 2022 16:44:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト