Autonomous vehicles are continually increasing their presence on public roads. However, before any new autonomous driving software can be approved, it must first undergo a rigorous assessment of driving quality. These quality evaluations typically focus on estimating the frequency of (undesirable) behavioral events. While rate estimation would be straight-forward with complete data, in the autonomous driving setting this estimation is greatly complicated by the fact that \textit{detecting} these events within large driving logs is a non-trivial task that often involves human reviewers. In this paper we outline a \textit{streaming partial tiered event review} configuration that ensures both high recall and high precision on the events of interest. In addition, the framework allows for valid streaming estimates at any phase of the data collection process, even when labels are incomplete, for which we develop the maximum likelihood estimate and show it is unbiased. Constructing honest and effective confidence intervals (CI) for these rate estimates, particularly for rare safety-critical events, is a novel and challenging statistical problem due to the complexity of the data likelihood. We develop and compare several CI approximations, including a novel Gamma CI method that approximates the exact but intractable distribution with a weighted sum of independent Poisson random variables. There is a clear trade-off between statistical coverage and interval width across the different CI methods, and the extent of this trade-off varies depending on the specific application settings (e.g., rare vs. common events). In particular, we argue that our proposed CI method is the best-suited when estimating the rate of safety-critical events where guaranteed coverage of the true parameter value is a prerequisite to safely launching a new ADS on public roads.
翻译:自动驾驶车辆在公共道路上的部署日益增多。然而,任何新的自动驾驶软件在获批前,都必须先经过严格的驾驶质量评估。这类质量评估通常侧重于估算(不良)行为事件的发生频率。虽然使用完整数据进行速率估计是直接的,但在自动驾驶场景中,由于在庞大的驾驶日志中检测该类事件本身是一项涉及人工审核的非平凡任务,这一估算过程变得极其复杂。本文提出了一种"流式部分分层事件审核"配置,以确保对目标事件实现高召回率与高精度。此外,该框架允许在数据收集过程的任何阶段(即使标签不完整时)进行有效的流式估计,为此我们推导了极大似然估计量并证明其无偏性。由于数据似然函数的复杂性,为这些速率估计构建准确有效的置信区间(尤其针对罕见的安全关键事件)是一个新颖且具有挑战性的统计问题。我们开发并比较了多种置信区间近似方法,包括一种新颖的伽马置信区间方法——该方法通过独立泊松随机变量的加权和来近似精确但难解的真实分布。不同置信区间方法在统计覆盖率和区间宽度之间存在明确的权衡,该权衡的程度随具体应用场景(例如,罕见事件与常见事件)而异。特别地,我们认为,在估计安全关键事件的速率时,我们提出的置信区间方法最为适用——因为在公共道路上安全部署新自动驾驶系统之前,确保真实参数值的覆盖概率是必要条件。