Extrapolated cross-validation for randomized ensembles

Ensemble methods such as bagging and random forests are ubiquitous in various fields, from finance to genomics. Despite their prevalence, the question of the efficient tuning of ensemble parameters has received relatively little attention. This paper introduces a cross-validation method, ECV (Extrapolated Cross-Validation), for tuning the ensemble and subsample sizes in randomized ensembles. Our method builds on two primary ingredients: initial estimators for small ensemble sizes using out-of-bag errors and a novel risk extrapolation technique that leverages the structure of prediction risk decomposition. By establishing uniform consistency of our risk extrapolation technique over ensemble and subsample sizes, we show that ECV yields $\delta$-optimal (with respect to the oracle-tuned risk) ensembles for squared prediction risk. Our theory accommodates general ensemble predictors, only requires mild moment assumptions, and allows for high-dimensional regimes where the feature dimension grows with the sample size. As a practical case study, we employ ECV to predict surface protein abundances from gene expressions in single-cell multiomics using random forests. In comparison to sample-split cross-validation and $K$-fold cross-validation, ECV achieves higher accuracy avoiding sample splitting. At the same time, its computational cost is considerably lower owing to the use of the risk extrapolation technique. Additional numerical results validate the finite-sample accuracy of ECV for several common ensemble predictors under a computational constraint on the maximum ensemble size.

翻译：集成方法（如装袋法和随机森林）在从金融到基因组学的各个领域都得到了广泛应用。然而，尽管其普及程度极高，针对集成参数高效调优的研究却相对较少。本文提出了一种用于调整随机集成中集成规模与子样本大小的交叉验证方法——ECV（外推交叉验证）。该方法基于两个核心要素：利用袋外误差对较小集成规模进行初始估计，以及一种新型风险外推技术（该技术通过利用预测风险分解的结构来发挥作用）。通过证明该风险外推技术关于集成规模和子样本大小的一致性，我们表明ECV能够针对平方预测风险生成$\delta$-最优（相对于基于预言机调优的风险）的集成模型。该理论适用于一般的集成预测器，仅需较弱的矩条件，并允许特征维度随样本量增长的高维场景。作为实际案例研究，我们利用ECV通过随机森林基于单细胞多组学中的基因表达预测表面蛋白丰度。与样本分裂交叉验证和$K$折交叉验证相比，ECV在避免样本分裂的同时实现了更高的准确性。同时，由于采用了风险外推技术，其计算成本显著降低。额外的数值结果验证了在最大集成规模计算约束下，ECV对几种常见集成预测器的有限样本准确性。