Model selection aims to identify a sufficiently well performing model that is possibly simpler than the most complex model among a pool of candidates. However, the decision-making process itself can inadvertently introduce non-negligible bias when the cross-validation estimates of predictive performance are marred by excessive noise. In finite data regimes, cross-validated estimates can encourage the statistician to select one model over another when it is not actually better for future data. While this bias remains negligible in the case of few models, when the pool of candidates grows, and model selection decisions are compounded (as in forward search), the expected magnitude of selection-induced bias is likely to grow too. This paper introduces an efficient approach to estimate and correct selection-induced bias based on order statistics. Numerical experiments demonstrate the reliability of our approach in both estimating selection-induced bias and quantifying the degree of over-fitting along compounded model selection decisions, with specific application to forward search. This work represents a light-weight alternative to more computationally expensive approaches to correcting selection-induced bias, such as nested cross-validation and the bootstrap. Our approach rests on several theoretic assumptions, and we provide a diagnostic to help understand when these may not be valid, and when to fall back on safer, albeit more computationally expensive approaches. The accompanying code facilitates its practical implementation and fosters further exploration in this area.
翻译:模型选择旨在从候选集合中识别出一个性能足够好且可能比最复杂模型更简洁的模型。然而,当预测性能的交叉验证估计受到过度噪声干扰时,决策过程本身可能会无意中引入不可忽视的偏差。在有限数据场景下,交叉验证估计可能促使统计学家在某个模型对未来数据并非更优时仍选择它而非其他模型。当候选模型数量较少时这种偏差可以忽略不计,但随着候选池扩大以及模型选择决策的叠加(如前向搜索),选择诱导偏差的预期幅值也可能随之增长。本文提出了一种基于顺序统计量的高效方法来估计并校正此类偏差。数值实验证明了该方法在估计选择诱导偏差以及量化叠加模型选择决策(特别应用于前向搜索)中过拟合程度方面的可靠性。相较于嵌套交叉验证和自举法等计算成本较高的替代方案,本研究提供了一种轻量级解决方案。该方法基于若干理论假设,我们提供了诊断工具以辅助判断假设何时可能失效,以及何时应回归更安全但计算成本更高的方法。配套代码有助于其实践应用,并推动该领域的进一步探索。