In this paper, we introduce ScenePilot-Bench, a large-scale first-person driving benchmark designed to evaluate vision-language models (VLMs) in autonomous driving scenarios. ScenePilot-Bench is built upon ScenePilot-4K, a diverse dataset comprising 3,847 hours of driving videos, annotated with multi-granularity information including scene descriptions, risk assessments, key participant identification, ego trajectories, and camera parameters. The benchmark features a four-axis evaluation suite that assesses VLM capabilities in scene understanding, spatial perception, motion planning, and GPT-Score, with safety-aware metrics and cross-region generalization settings. We benchmark representative VLMs on ScenePilot-Bench, providing empirical analyses that clarify current performance boundaries and identify gaps for driving-oriented reasoning. ScenePilot-Bench offers a comprehensive framework for evaluating and advancing VLMs in safety-critical autonomous driving contexts.
翻译:本文介绍了ScenePilot-Bench,这是一个大规模第一人称驾驶基准,旨在评估自动驾驶场景中的视觉语言模型。ScenePilot-Bench构建于ScenePilot-4K之上,后者是一个包含3,847小时驾驶视频的多样化数据集,并标注了多粒度信息,包括场景描述、风险评估、关键参与者识别、自车轨迹以及相机参数。该基准采用四轴评估套件,从场景理解、空间感知、运动规划和GPT-Score四个方面评估VLM的能力,并引入了安全感知指标和跨区域泛化设置。我们在ScenePilot-Bench上对代表性VLM进行了基准测试,提供了实证分析,以阐明当前性能边界并识别面向驾驶推理的差距。ScenePilot-Bench为在安全关键的自动驾驶环境中评估和推进VLM提供了一个全面的框架。