Average calibration of the (variance-based) prediction uncertainties of machine learning regression tasks can be tested in two ways: one is to estimate the calibration error (CE) as the difference between the mean absolute error (MSE) and the mean variance (MV); the alternative is to compare the mean squared z-scores (ZMS) to 1. The problem is that both approaches might lead to different conclusions, as illustrated in this study for an ensemble of datasets from the recent machine learning uncertainty quantification (ML-UQ) literature. It is shown that the estimation of MV, MSE and their confidence intervals becomes unreliable for heavy-tailed uncertainty and error distributions, which seems to be a frequent feature of ML-UQ datasets. By contrast, the ZMS statistic is less sensitive and offers the most reliable approach in this context. Unfortunately, the same problem is expected to affect also conditional calibrations statistics, such as the popular ENCE, and very likely post-hoc calibration methods based on similar statistics. Several solutions to circumvent the outlined problems are proposed.
翻译:针对机器学习回归任务(基于方差的)预测不确定性的平均校准可通过两种方式检验:其一,将校准误差(CE)定义为均方误差(MSE)与平均方差(MV)之差;其二,将平均平方z分数(ZMS)与1比较。问题在于,这两种方法可能得出不同结论——本研究针对近期机器学习不确定性量化(ML-UQ)文献中的数据集集合验证了这一现象。研究表明,当不确定性和误差分布呈现重尾特征时(这似乎是ML-UQ数据集的常见特征),MV、MSE及其置信区间的估计变得不可靠。相比之下,ZMS统计量在此情境下敏感性较低,提供了最可靠的方法。遗憾的是,预计相同问题同样会影响条件校准统计量(如流行的ENCE)以及基于类似统计量的后验校准方法。本文提出了若干规避上述问题的解决方案。