NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

Tianwen Qian; Jingjing Chen; Linhai Zhuo; Yang Jiao; Yu-Gang Jiang

NuScenes-QA: 自動運転シナリオ向けのマルチモーダルなビジュアル質問応答ベンチマーク

自動運転のコンテキストに新しい視覚的質問応答 (VQA) タスクを導入し、ストリートビューの手がかりに基づいて自然言語の質問に答えることを目的としています。従来の VQA タスクと比較して、自動運転シナリオにおける VQA には多くの課題があります。まず、生の視覚データは、カメラと LiDAR でそれぞれキャプチャされた画像と点群を含むマルチモーダルです。第 2 に、連続的なリアルタイム取得により、データはマルチフレームになります。第三に、屋外シーンでは、動く前景と静的な背景の両方が表示されます。既存の VQA ベンチマークは、これらの複雑さに適切に対処できません。このギャップを埋めるために、私たちは、34K のビジュアルシーンと 460K の質問と回答のペアを含む、自動運転シナリオにおける VQA の最初のベンチマークである NuScenes-QA を提案します。具体的には、既存の 3D 検出アノテーションを活用して、シーングラフを生成し、質問テンプレートを手動で設計します。その後、これらのテンプレートに基づいて質問と回答のペアがプログラムによって生成されます。包括的な統計により、NuScenes-QA が多様な質問形式を備えたバランスのとれた大規模ベンチマークであることが証明されています。これに基づいて、高度な 3D 検出および VQA 技術を採用した一連のベースラインを開発します。私たちの広範な実験により、この新しいタスクによってもたらされる課題が浮き彫りになりました。コードとデータセットは https://github.com/qiantianwen/NuScenes-QA で入手できます。

We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA.

updated: Tue Feb 20 2024 05:04:58 GMT+0000 (UTC)

published: Wed May 24 2023 07:40:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト