Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.
翻译:协同自动驾驶需要从车辆和基础设施两个视角进行交通场景理解。尽管视觉语言模型(VLM)展现出强大的通用推理能力,但由于现有基准测试以自车为中心,其在安全关键交通场景中的表现尚未得到充分评估。为填补这一空白,我们提出**CrashSight**——一个基于真实世界路侧摄像头数据的大规模交通事故理解视觉语言基准。该数据集包含250个事故视频,标注了13K道多选题问答对,并按照两层分类体系组织:第一层评估场景背景与相关方的视觉定位能力,第二层则探究高级推理能力,包括事故力学机制、因果归因、时间演进过程及事故后结果。我们对8个最先进的VLM进行了基准测试,结果表明:尽管当前模型具备强大的场景描述能力,但在安全关键场景中的时间与因果推理方面仍存在困难。我们提供了失败场景的详细分析,并讨论了提升VLM事故理解能力的方向。该基准为协同自动驾驶中的基础设施辅助感知提供了标准化评估框架。CrashSight基准测试(包括完整数据集和代码)可通过https://mcgrche.github.io/crashsight获取。