Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.
翻译:协同自动驾驶需要从车辆和基础设施两个角度进行交通场景理解。尽管视觉-语言模型(VLM)展现出强大的通用推理能力,但由于现有基准测试以自车为中心,其在安全关键交通场景中的性能尚未得到充分评估。为弥合这一差距,我们提出**CrashSight**,一个利用真实世界路侧摄像头数据进行道路交通事故理解的大规模视觉-语言基准。该数据集包含250个事故视频,配有在双层分类体系下组织的13K道多项选择题-答案对。第一层评估场景上下文和所涉实体的视觉定位能力,而第二层则探究更高层次的推理,包括事故力学、因果归因、时序演变及事故后结果。我们对8种最先进的VLM进行基准测试,结果表明,尽管现有模型具备强大的场景描述能力,但在安全关键场景中处理时序和因果推理方面仍存在困难。我们提供了失败场景的详细分析,并讨论了改进VLM事故理解的方向。该基准为协同自动驾驶中的基础设施辅助感知提供了标准化的评估框架。CrashSight基准(包括完整数据集和代码)可在https://mcgrche.github.io/crashsight获取。