Model selection aims to identify a sufficiently well performing model that is possibly simpler than the most complex model among a pool of candidates. However, the decision-making process itself can inadvertently introduce non-negligible bias when the cross-validation estimates of predictive performance are marred by excessive noise. In finite data regimes, cross-validated estimates can encourage the statistician to select one model over another when it is not actually better for future data. While this bias remains negligible in the case of few models, when the pool of candidates grows, and model selection decisions are compounded (as in forward search), the expected magnitude of selection-induced bias is likely to grow too. This paper introduces an efficient approach to estimate and correct selection-induced bias based on order statistics. Numerical experiments demonstrate the reliability of our approach in estimating both selection-induced bias and over-fitting along compounded model selection decisions, with specific application to forward search. This work represents a light-weight alternative to more computationally expensive approaches to correcting selection-induced bias, such as nested cross-validation and the bootstrap. Our approach rests on several theoretic assumptions, and we provide a diagnostic to help understand when these may not be valid and when to fall back on safer, albeit more computationally expensive approaches. The accompanying code facilitates its practical implementation and fosters further exploration in this area.
翻译:模型选择旨在从候选模型池中识别出一个性能足够好且可能比最复杂模型更简单的模型。然而,当预测性能的交叉验证估计受到过度噪声影响时,决策过程本身可能无意中引入不可忽略的偏差。在有限数据场景下,交叉验证估计可能促使统计学家在某个模型对未来数据并非实际更优时仍选择它。尽管在候选模型数量较少时该偏差可忽略,但随着候选池扩大且模型选择决策叠加(如前向搜索),选择诱导偏差的预期幅度也可能增大。本文提出一种基于顺序统计量的高效方法来估计并校正选择诱导偏差。数值实验表明,我们的方法在估计复合模型选择决策(特别是前向搜索)中的选择诱导偏差和过拟合方面具有可靠性。与嵌套交叉验证和自助法等计算开销更大的偏差校正方法相比,本工作提供了一种轻量化替代方案。该方法基于若干理论假设,我们提供了诊断工具以帮助理解这些假设何时可能不成立,以及何时应回归更安全但计算更昂贵的方法。附带的代码便于实践应用并促进该领域的进一步探索。