Understanding road scenes is essential for autonomous driving, as it enables systems to interpret visual surroundings to aid in effective decision-making. We present Roadscapes, a multitask multimodal dataset consisting of upto 9,000 images captured in diverse Indian driving environments, accompanied by manually verified bounding boxes. To facilitate scalable scene understanding, we employ rule-based heuristics to infer various scene attributes, which are subsequently used to generate question-answer (QA) pairs for tasks such as object grounding, reasoning, and scene understanding. The dataset includes a variety of scenes from urban and rural India, encompassing highways, service roads, village paths, and congested city streets, captured in both daytime and nighttime settings. Roadscapes has been curated to advance research on visual scene understanding in unstructured environments. In this paper, we describe the data collection and annotation process, present key dataset statistics, and provide initial baselines for image QA tasks using vision-language models.
翻译:理解道路场景对于自动驾驶至关重要,它使系统能够解释视觉环境以辅助有效决策。本文提出Roadscapes数据集,这是一个多任务多模态数据集,包含在多样化印度驾驶环境中采集的约9,000张图像,并配有经人工验证的边界框。为促进可扩展的场景理解,我们采用基于规则的启发式方法推断各类场景属性,并利用这些属性生成面向目标定位、推理和场景理解等任务的问答对。数据集涵盖印度城乡多种场景,包括高速公路、辅路、乡村道路和拥堵的城市街道,并包含日间与夜间拍摄条件。Roadscapes数据集旨在推动非结构化环境中视觉场景理解的研究。本文详细描述了数据采集与标注流程,展示了关键数据集统计信息,并利用视觉-语言模型为图像问答任务提供了初步基线结果。