Average calibration of the uncertainties of machine learning regression tasks can be tested in two ways. One way is to estimate the calibration error (CE) as the difference between the mean absolute error (MSE) and the mean variance (MV) or mean squared uncertainty. The alternative is to compare the mean squared z-scores or scaled errors (ZMS) to 1. Both approaches might lead to different conclusion, as illustrated on an ensemble of datasets from the recent machine learning uncertainty quantification literature. It is shown here that the CE is very sensitive to the distribution of uncertainties, and notably to the presence of outlying uncertainties, and that it cannot be used reliably for calibration testing. By contrast, the ZMS statistic does not present this sensitivity issue and offers the most reliable approach in this context. Implications for the validation of conditional calibration are discussed.
翻译:机器学习回归任务不确定性的平均校准可通过两种方式进行验证。一种方式是估计校准误差(CE),即平均绝对误差(MSE)与均值方差(MV)或均方不确定度之差。另一种方式是将均方Z分数或尺度化误差(ZMS)与1进行比较。正如近期机器学习不确定性量化文献中一组数据集所示,这两种方法可能导致不同的结论。研究表明,CE对不确定性的分布非常敏感,尤其是对异常不确定性的存在极其敏感,因此无法可靠地用于校准检验。相比之下,ZMS统计量不存在这种敏感性问题,在此语境下提供了最可靠的方法。本文还讨论了条件校准验证的相关启示。