LingoQA: Visual Question Answering for Autonomous Driving

Ana-Maria Marcu; Long Chen; Jan Hünermann; Alice Karnsund; Benoit Hanotte; Prajwal Chidananda; Saurabh Nair; Vijay Badrinarayanan; Alex Kendall; Jamie Shotton; Elahe Arani; Oleg Sinavski

We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.

updated: Thu Sep 26 2024 15:30:00 GMT+0000 (UTC)

published: Thu Dec 21 2023 18:40:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト