Long-tail scenarios remain a major bottleneck for autonomous driving evaluation, even as datasets grow by orders of magnitude. Existing evaluation pipelines are rarely human-aligned, safety-aware, verifiable, and explainable at the same time: closed-loop metrics often saturate among strong planners, while unstructured human ratings can be noisy without a carefully designed protocol. We formulate planning evaluation as additional-threat detection: given a planner trajectory and an expert reference, does the planner's displacement introduce new unsafe driving behavior? We propose FluidTest, an evaluation pipeline with three components: a pairwise WebUI protocol for reliable human annotation; a taxonomy of 32 semantic threats with evidence-grounded decision graphs; and a three-agent verification system with reflection for precision and auditability. Experiments on the WOD-E2E dataset show that FluidTest produces consistent labels among trained annotators and identifies additional threats in 65% of Poutine trajectories and 51% of RAP trajectories. These results show that state-of-the-art planners can still exhibit substantial safety-relevant failures despite high Rater Feedback Scores (RFS) and low Average Displacement Error (ADE). Additional details, guidance, and code are available at https://fluidtest.web.app.
翻译:长尾场景仍是自动驾驶评估的主要瓶颈,即便数据集规模呈数量级增长。现有评估流程难以同时实现与人类对齐、安全感知、可验证及可解释性:闭环指标在强规划器中往往趋于饱和,而缺乏精心设计协议的非结构化人类评分则可能引入噪声。我们将规划评估构建为额外威胁检测任务:给定规划器轨迹与专家参考轨迹,规划器的位移是否引入了新的不安全驾驶行为?为此提出FluidTest评估流程,包含三个组件:用于可靠人工标注的配对WebUI协议;包含32种语义威胁及证据驱动决策图的分类体系;以及具备反思机制的三智能体验证系统,确保精度与可审计性。在WOD-E2E数据集上的实验表明,FluidTest能在受过训练的标注者间产生一致性标签,并在65%的Poutine轨迹与51%的RAP轨迹中识别出额外威胁。这些结果表明,尽管最先进的规划器具有较高的Rater反馈评分(RFS)与较低的平均位移误差(ADE),其仍可能表现出显著的安全相关故障。更多细节、指南及代码请访问https://fluidtest.web.app。