Ensemble methods such as bagging and random forests are ubiquitous in fields ranging from finance to genomics. However, the question of the efficient tuning of ensemble parameters has received relatively little attention. In this paper, we propose a cross-validation method, ECV (Extrapolated Cross-Validation), for tuning the ensemble and subsample sizes of randomized ensembles. Our method builds on two main ingredients: two initial estimators for small ensemble sizes using out-of-bag errors and a novel risk extrapolation technique leveraging the structure of the prediction risk decomposition. By establishing uniform consistency over ensemble and subsample sizes, we show that ECV yields $\delta$-optimal (with respect to the oracle-tuned risk) ensembles for squared prediction risk. Our theory accommodates general ensemble predictors, requires mild moment assumptions, and allows for high-dimensional regimes where the feature dimension grows with the sample size. As an illustrative example, we employ ECV to predict surface protein abundances from gene expressions in single-cell multiomics using random forests. Compared to sample-split cross-validation and K-fold cross-validation, ECV achieves higher accuracy avoiding sample splitting. Meanwhile, its computational cost is considerably lower owing to the use of the risk extrapolation technique. Further numerical results demonstrate the finite-sample accuracy of ECV for several common ensemble predictors.
翻译:集成方法(如Bagging和随机森林)在从金融到基因组学等各个领域都无处不在。然而,关于集成参数的高效调优问题却鲜有研究。本文提出一种名为ECV(外推交叉验证)的交叉验证方法,用于调优随机集成方法的集成规模和子采样规模。该方法基于两个核心要素:利用袋外误差为小集成规模构建两个初始估计量,以及一种利用预测风险分解结构的新型外推技术。通过建立集成规模与子采样规模的一致性,我们证明ECV能够为平方预测风险生成δ最优(相对于基于oracle的调优风险)的集成模型。该理论适用于通用集成预测器,仅需较弱的矩条件,并允许特征维度随样本量增长的高维场景。作为示例,我们利用随机森林通过ECV预测单细胞多组学中基因表达的表面蛋白丰度。相较于样本分割交叉验证和K折交叉验证,ECV在避免样本分割的同时实现了更高精度。此外,由于采用了风险外推技术,其计算成本显著降低。更多数值结果表明,ECV在多种常见集成预测器上均具备有限样本精度。