Any supervised machine learning analysis is required to provide an estimate of the out-of-sample predictive performance. However, it is imperative to also provide a quantification of the uncertainty of this performance in the form of a confidence or credible interval (CI) and not just a point estimate. In an AutoML setting, estimating the CI is challenging due to the ``winner's curse", i.e., the bias of estimation due to cross-validating several machine learning pipelines and selecting the winning one. In this work, we perform a comparative evaluation of 9 state-of-the-art methods and variants in CI estimation in an AutoML setting on a corpus of real and simulated datasets. The methods are compared in terms of inclusion percentage (does a 95\% CI include the true performance at least 95\% of the time), CI tightness (tighter CIs are preferable as being more informative), and execution time. The evaluation is the first one that covers most, if not all, such methods and extends previous work to imbalanced and small-sample tasks. In addition, we present a variant, called BBC-F, of an existing method (the Bootstrap Bias Correction, or BBC) that maintains the statistical properties of the BBC but is more computationally efficient. The results support that BBC-F and BBC dominate the other methods in all metrics measured.
翻译:任何监督机器学习分析都需要提供样本外预测性能的估计值。然而,仅提供点估计是不够的,还必须以置信区间或可信区间(CI)的形式量化该性能的不确定性。在自动机器学习(AutoML)环境中,由于需要交叉验证多个机器学习流程并选择最优流程(即“赢者诅咒”)导致的估计偏差,使得置信区间的估计具有挑战性。本研究在真实和模拟数据集上,对自动机器学习背景下置信区间估计的9种前沿方法及其变体进行了比较评估。这些方法从包含百分比(95%的置信区间是否至少在95%的情况下包含了真实性能)、置信区间紧密度(更紧的置信区间因信息量更大而更优)以及执行时间三个方面进行比较。该评估首次涵盖了大部分(若非全部)此类方法,并将先前工作扩展到了不平衡和小样本任务。此外,我们提出了一种现有方法(Bootstrap Bias Correction,简称BBC)的变体,称为BBC-F,它保持了BBC的统计特性,但计算效率更高。结果表明,在所有测量指标上,BBC-F和BBC均优于其他方法。