Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.
翻译:自动驾驶正向端到端策略学习发展,其中可靠、可解释的策略评估是基本挑战,因为驾驶质量高度依赖上下文。常用的基于规则的驾驶指标(如EPDMS)具有可解释性但缺乏上下文感知能力,而近期基于VLM的评估方法虽具备上下文感知能力,却受限于模糊的VLM输出和薄弱的物理基础。为以兼具可解释性与上下文感知性的方式评估驾驶行为,我们提出DriveJudge。这是一种驾驶评估智能体,结合了规则驱动评估与视觉语言模型(VLM)推理,并在解释环境上下文后选择性调用基于物理的确定性规则函数。为训练和评估DriveJudge,我们构建了包含33,577个具有挑战性驾驶样本的大规模数据集,并标注了驾驶行为在特定场景下是否合理。基于该数据集,我们探讨了驾驶指标评估这一尚未充分研究的问题,并引入两项与人类意图对齐的基准任务:驾驶质量分类与轨迹偏好选择。DriveJudge在驾驶质量分类任务上比EPDMS提升21.23 AUC,在轨迹偏好选择任务上比近期基于VLM的DriveCritic提升6.5%,为可解释且精准的驾驶评估树立了新标准。