In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting the sample at hand into a training, validation, and evaluation set, and only compute a single confidence interval for the prediction performance of the final selected model. We however propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set by interpreting the selection problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplicity correction. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights. We conducted various simulation experiments which show that our proposed approach yields lower confidence bounds that are at least comparably good as bounds from standard approaches, and that reliably reach the nominal coverage probability. In addition, especially when sample size is small, our proposed approach yields better performing prediction models than the default selection of only one model for evaluation does.
翻译:在机器学习中,从可能大量竞争模型中选择一个有前景的模型并评估其泛化性能是需谨慎考虑的关键任务。通常,模型选择与评估是严格分离的工作,将现有样本划分为训练集、验证集和评估集,并仅对最终选定模型的预测性能计算单个置信区间。然而,我们提出了一种算法,通过将选择问题解释为同时推断问题,为基于评估集中预测性能选定的多个模型计算有效的下置信界限。该方法采用自助倾斜法和maxT型多重性校正。该方案具有普适性,适用于任何预测模型组合、任意模型选择策略以及任何可接受权重的预测性能指标。我们进行了多项模拟实验,结果表明:所提方法得到的下置信界限至少与标准方法所得界限相当,且能可靠地达到名义覆盖率。此外,尤其在样本量较小时,该方法能比默认只选择单一模型进行评估的方式获得性能更优的预测模型。