Model selection aims to identify a sufficiently well performing model that is possibly simpler than the most complex model among a pool of candidates. However, the decision-making process itself can inadvertently introduce non-negligible bias when the cross-validation estimates of predictive performance are marred by excessive noise. In finite data regimes, cross-validated estimates can encourage the statistician to select one model over another when it is not actually better for future data. While this bias remains negligible in the case of few models, when the pool of candidates grows, and model selection decisions are compounded (as in step-wise selection), the expected magnitude of selection-induced bias is likely to grow too. This paper introduces an efficient approach to estimate and correct selection-induced bias based on order statistics. Numerical experiments demonstrate the reliability of our approach in estimating both selection-induced bias and over-fitting along compounded model selection decisions, with specific application to forward search. This work represents a light-weight alternative to more computationally expensive approaches to correcting selection-induced bias, such as nested cross-validation and the bootstrap. Our approach rests on several theoretic assumptions, and we provide a diagnostic to help understand when these may not be valid and when to fall back on safer, albeit more computationally expensive approaches. The accompanying code facilitates its practical implementation and fosters further exploration in this area.
翻译:模型选择旨在从候选模型池中识别出性能足够优异且可能比最复杂模型更简洁的模型。然而,当预测性能的交叉验证估计受到过度噪声干扰时,决策过程本身可能无意中引入不可忽略的偏差。在有限数据条件下,交叉验证估计可能导致统计学家选择某个模型,而该模型对未来数据实际上并非更优。虽然这种偏差在候选模型较少时可忽略不计,但当候选池扩大且模型选择决策叠加时(如逐步选择法),选择引起偏差的预期幅度也可能随之增大。本文提出一种基于顺序统计量的高效方法来估计和校正选择引起的偏差。数值实验证明了该方法在估计叠加模型选择决策中的选择偏差与过拟合方面的可靠性,并以前向搜索为具体应用场景。本研究为校正选择偏差提供了一种轻量级替代方案,相较于嵌套交叉验证和自助法等计算成本更高的方法。本方法基于若干理论假设,我们提供了诊断工具以帮助理解这些假设何时可能失效,以及何时应转而采用更安全但计算成本更高的方法。随附的代码促进了该方法的实际应用,并为该领域的进一步探索提供了支持。