Autonomous driving is a highly challenging domain that requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonstrate reasoning and generalization abilities, opening new possibilities for autonomous driving; however, existing benchmarks and metrics overemphasize perceptual competence and fail to adequately assess decision-making processes. In this work, we present AutoDriDM, a decision-centric, progressive benchmark with 6,650 questions across three dimensions - Object, Scene, and Decision. We evaluate mainstream VLMs to delineate the perception-to-decision capability boundary in autonomous driving, and our correlation analysis reveals weak alignment between perception and decision-making performance. We further conduct explainability analyses of models' reasoning processes, identifying key failure modes such as logical reasoning errors, and introduce an analyzer model to automate large-scale annotation. AutoDriDM bridges the gap between perception-centered and decision-centered evaluation, providing guidance toward safer and more reliable VLMs for real-world autonomous driving.
翻译:自动驾驶是一个极具挑战性的领域,需要在复杂场景中实现可靠的感知与安全的决策。近期的视觉语言模型(VLMs)展现出推理与泛化能力,为自动驾驶开辟了新的可能性;然而,现有的基准与指标过度强调感知能力,未能充分评估决策过程。在本工作中,我们提出了AutoDriDM,这是一个以决策为中心的渐进式基准,包含涵盖对象、场景与决策三个维度的6,650个问题。我们评估了主流的VLMs,以描绘自动驾驶中从感知到决策的能力边界,并且我们的相关性分析揭示了感知性能与决策性能之间的弱对齐关系。我们进一步对模型的推理过程进行了可解释性分析,识别出逻辑推理错误等关键失败模式,并引入了一个分析器模型以实现大规模标注的自动化。AutoDriDM弥合了以感知为中心和以决策为中心的评估之间的差距,为开发面向真实世界自动驾驶的更安全、更可靠的VLMs提供了指导。