Ensemble methods such as bagging and random forests are ubiquitous in various fields, from finance to genomics. Despite their prevalence, the question of the efficient tuning of ensemble parameters has received relatively little attention. This paper introduces a cross-validation method, ECV (Extrapolated Cross-Validation), for tuning the ensemble and subsample sizes in randomized ensembles. Our method builds on two primary ingredients: initial estimators for small ensemble sizes using out-of-bag errors and a novel risk extrapolation technique that leverages the structure of prediction risk decomposition. By establishing uniform consistency of our risk extrapolation technique over ensemble and subsample sizes, we show that ECV yields $\delta$-optimal (with respect to the oracle-tuned risk) ensembles for squared prediction risk. Our theory accommodates general ensemble predictors, only requires mild moment assumptions, and allows for high-dimensional regimes where the feature dimension grows with the sample size. As a practical case study, we employ ECV to predict surface protein abundances from gene expressions in single-cell multiomics using random forests. In comparison to sample-split cross-validation and $K$-fold cross-validation, ECV achieves higher accuracy avoiding sample splitting. At the same time, its computational cost is considerably lower owing to the use of the risk extrapolation technique. Additional numerical results validate the finite-sample accuracy of ECV for several common ensemble predictors under a computational constraint on the maximum ensemble size.
翻译:诸如装袋法和随机森林等集成方法在从金融到基因组学的各个领域中都无处不在。尽管它们广泛应用,但关于集成参数高效调优的问题却相对较少受到关注。本文提出一种用于调优随机集成中集成规模和子样本规模的交叉验证方法,即外推交叉验证(ECV)。我们的方法基于两个主要成分:利用袋外误差对较小集成规模进行初始估计,以及一种利用预测风险分解结构的新型风险外推技术。通过证明该风险外推技术在集成规模和子样本规模上的一致性,我们展示了ECV在平方预测风险下能够生成相对于最优调优风险而言δ-最优的集成。我们的理论适用于一般集成预测器,仅需要较弱的矩假设,并允许特征维度随样本量增长的高维场景。作为实际案例研究,我们利用ECV通过随机森林从单细胞多组学中的基因表达预测表面蛋白丰度。与样本分裂交叉验证和K折交叉验证相比,ECV避免了样本分裂,从而实现了更高的精度。同时,由于采用了风险外推技术,其计算成本显著降低。在最大集成规模的计算约束下,额外的数值结果验证了ECV在几种常见集成预测器上的有限样本精度。