Average calibration of the uncertainties of machine learning regression tasks can be tested in two ways. One way is to estimate the calibration error (CE) as the difference between the mean absolute error (MSE) and the mean variance (MV) or mean squared uncertainty. The alternative is to compare the mean squared z-scores or scaled errors (ZMS) to 1. Both approaches might lead to different conclusion, as illustrated on an ensemble of datasets from the recent machine learning uncertainty quantification literature. It is shown here that the CE is very sensitive to the distribution of uncertainties, and notably to the presence of outlying uncertainties, and that it cannot be used reliably for calibration testing. By contrast, the ZMS statistic does not present this sensitivity issue and offers the most reliable approach in this context. Implications for the validation of conditional calibration are discussed.
翻译:机器学习回归任务不确定性的平均校准可通过两种方式进行检验。一种方式是估计校准误差(CE),即平均绝对误差(MSE)与平均方差(MV)或均方不确定性之差;另一种方式是将均方z分数或缩放误差(ZMS)与1进行比较。这两种方法可能得出不同结论——正如近期机器学习不确定性量化文献中的数据集集成所示。本研究表明,CE对不确定性分布高度敏感,尤其易受异常不确定性值影响,因此无法可靠用于校准检验。相比之下,ZMS统计量不存在此类敏感性问题,在此背景下提供了最可靠的方法。本文还讨论了条件校准验证的相关启示。