After a machine learning model has been deployed into production, its predictive performance needs to be monitored. Ideally, such monitoring can be carried out by comparing the model's predictions against ground truth labels. For this to be possible, the ground truth labels must be available relatively soon after inference. However, there are many use cases where ground truth labels are available only after a significant delay, or in the worst case, not at all. In such cases, directly monitoring the model's predictive performance is impossible. Recently, novel methods for estimating the predictive performance of a model when ground truth is unavailable have been developed. Many of these methods leverage model confidence or other uncertainty estimates and are experimentally compared against a naive baseline method, namely Average Confidence (AC), which estimates model accuracy as the average of confidence scores for a given set of predictions. However, until now the theoretical properties of the AC method have not been properly explored. In this paper, we try to fill this gap by reviewing the AC method and show that under certain general assumptions, it is an unbiased and consistent estimator of model accuracy with many desirable properties. We also compare this baseline estimator against some more complex estimators empirically and show that in many cases the AC method is able to beat the others, although the comparative quality of the different estimators is heavily case-dependent.
翻译:机器学习模型部署至生产环境后,其预测性能需要持续监控。理想情况下,可通过将模型预测结果与真实标签进行对比来实现监控。这要求真实标签能在推理后较短时间内获得。然而,在许多实际应用场景中,真实标签存在显著延迟,甚至完全无法获取。此类情况下,直接监控模型预测性能变得不可行。近期研究提出了在真实标签缺失时估计模型预测性能的新方法。这些方法大多利用模型置信度或其他不确定性估计量,并通过实验与朴素基准方法——平均置信度法进行比较。平均置信度法通过计算特定预测集置信度分数的平均值来估计模型准确率。然而,该方法的理论特性迄今尚未得到系统探究。本文通过系统分析平均置信度法填补这一研究空白,证明在特定通用假设下,该方法可作为模型准确率的无偏且一致估计量,并具备诸多优良特性。我们进一步通过实证将基准估计量与若干复杂估计量进行对比,结果表明:尽管不同估计量的相对性能高度依赖具体场景,平均置信度法在多数情况下仍能超越其他方法。