We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA.
翻译:摘要:我们提出了一项面向自动驾驶场景的新型视觉问答(VQA)任务,旨在基于街景线索回答自然语言问题。相较于传统VQA任务,自动驾驶场景的VQA具有更多挑战。首先,原始视觉数据是模态多样的,包含由摄像头和激光雷达分别捕捉的图像与点云。其次,由于实时连续采集,数据呈现多帧特性。再次,户外场景同时包含运动前景与静态背景。现有VQA基准未能充分应对这些复杂性。为填补这一空白,我们提出NuScenes-QA——首个面向自动驾驶场景的VQA基准,涵盖34K个视觉场景与460K个问答对。具体而言,我们利用现有3D检测标注生成场景图并人工设计问题模板,随后基于这些模板程序化生成问答对。综合统计表明,NuScenes-QA是一个具有多样化问题格式的平衡大规模基准。基于此基准,我们开发了一系列采用先进3D检测与VQA技术的基线模型。大量实验凸显了该新任务带来的挑战。代码与数据集已发布于https://github.com/qiantianwen/NuScenes-QA。