Optimal Hold-Out Size in Cross-Validation

Cross-validation (CV) is routinely used across the sciences to select models and tune parameters, and the resulting choices are often interpreted as substantive scientific conclusions (e.g., which variables, mechanisms, or risk factors are ``supported by the data''). A key part of the CV procedure -- the hold-out size, or equivalently the fold count $K$ -- is typically set by convention (e.g., 80/20, $K=5$) rather than by a principled criterion. Central to the issue is the tradeoff between training and testing: increasing the training sample size improves model accuracy, while sacrificing certainty around the accuracy itself. We formalize the tradeoff by targeting predictive performance and explicitly penalizing evaluation uncertainty, which cannot be identified from the data without additional assumptions. We derive finite-sample expressions of this evaluation uncertainty under symmetric errors and general upper bounds under broader error conditions, yielding a transparent utility-based rule for selecting the hold-out size as a function of an irreducible-noise parameter. Empirical analyses with linear regression and random forests across multiple domains, and a high-dimensional genomics application, show that (i) the choice of $K$ is dependent on the data and model. (ii) the optimal $K$ varies based on the assumption on the irreducible error, and (iii) the implied inferential conclusions can change materially as the irreducible error, and thus $K$, varies. The resulting framework replaces a one-size-fits-all convention with a context-specific, assumption-explicit choice of $K$, enabling more reliable model comparisons and downstream scientific inference.

翻译：交叉验证（CV）在科学领域被常规用于模型选择和参数调优，其选择结果常被解释为实质性的科学结论（例如，哪些变量、机制或风险因素“得到数据支持”）。CV流程的一个关键部分——保留集大小，或等效的折数$K$——通常依据惯例设定（例如80/20划分、$K=5$），而非基于有原则的标准。此问题的核心在于训练与测试之间的权衡：增加训练样本量可提升模型精度，但会牺牲对精度本身估计的确定性。我们通过以预测性能为目标并显式惩罚评估不确定性来形式化这一权衡，而评估不确定性在没有额外假设的情况下无法仅从数据中识别。我们在对称误差条件下推导了该评估不确定性的有限样本表达式，并在更广泛的误差条件下给出了一般上界，从而得到一个基于效用、透明的规则，用于根据不可约噪声参数选择保留集大小。通过线性回归和随机森林在多领域进行的实证分析，以及一个高维基因组学应用表明：（i）$K$的选择依赖于数据和模型；（ii）最优$K$随关于不可约误差的假设而变化；（iii）随着不可约误差及相应$K$的变化，所隐含的推断结论可能发生实质性改变。所提出的框架用特定情境下、假设明确的$K$选择替代了“一刀切”的惯例，从而能够实现更可靠的模型比较及后续科学推断。

相关内容

交叉验证

关注 2

交叉验证，有时也称为旋转估计或样本外测试，是用于评估统计结果如何的各种类似模型验证技术中的任何一种分析将概括为一个独立的数据集。它主要用于设置，其目的是预测，和一个想要估计如何准确地一个预测模型在实践中执行。在预测问题中，通常会给模型一个已知数据的数据集，在该数据集上进行训练（训练数据集）以及未知数据（或首次看到的数据）的数据集（根据该数据集测试模型）（称为验证数据集或测试集）。交叉验证的目标是测试模型预测未用于估计数据的新数据的能力，以发现诸如过度拟合或选择偏倚之类的问题，并提供有关如何进行建模的见解。该模型将推广到一个独立的数据集（例如，未知数据集，例如来自实际问题的数据集）。一轮交叉验证涉及分割一个样品的数据到互补的子集，在一个子集执行所述分析（称为训练集），以及验证在另一子集中的分析（称为验证集合或测试集）。为了减少可变性，在大多数方法中，使用不同的分区执行多轮交叉验证，并将验证结果组合（例如取平均值）在各轮中，以估计模型的预测性能。总而言之，交叉验证结合了预测中适用性的度量（平均），以得出模型预测性能的更准确估计。

大语言模型驱动的最优化方法：基于生成式人工智能的建模、求解与验证

专知会员服务

38+阅读 · 1月25日

大模型中视觉指令调优怎么做？腾讯最新《视觉-语言指令调优》综述与分析

专知会员服务

45+阅读 · 2023年11月18日

非凸优化问题综述“从对称性到几何性”，罗切斯特大学等

专知会员服务

29+阅读 · 2022年7月17日