DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

Hao Vo,Khoa Vo,Phu Loc Nguyen,Sieu Tran,Duc Minh Nguyen,Ngo Xuan Cuong,Gladys Gawugah,Sreevenkata Anjani Tishita Godavarthi,Chase Rainwater,Nghi D. Q. Bui,Anh Nguyen,Duy Minh Ho Nguyen,Ngan Le

Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.

翻译：自动驾驶中的时空智能要求智能体将多视角观测整合为一致的场景表征，维持目标在视角与时间维度上的连续性，并推理空间关系、交互作用及未来动态。然而，现有自动驾驶视觉语言基准主要聚焦于单视角、静态、自中心或单源问答任务，尚未明确当前视觉语言模型（VLM）能否真正构建并推理动态驾驶场景。为此，我们提出DriveSpatial基准，该基准基于五个大规模自动驾驶数据集，包含20个任务共计15.6万个人工验证的问答对。DriveSpatial评估四项核心能力：认知场景构建、多视角关系理解、时序推理与泛化能力。与先前基准不同，DriveSpatial基于动态多关系场景图生成，该场景图编码了目标状态、空间关系、交互作用、相机可见性及时序对应关系，从而生成强制要求跨视角与时空推理的问答对。对15个代表性VLM的评估揭示了显著的人机差距：最强模型落后人类28.4个百分点，其中认知场景构建成为关键瓶颈。进一步诊断表明，纯语言提示策略效果不足，而显式BEV空间定位方法可持续提升性能。这些结果表明，当前VLM缺乏实现可靠自动驾驶时空智能所必需的场景构建能力。DriveSpatial与其构建流程将公开发布以支持未来研究。