Bagging is a commonly used ensemble technique in statistics and machine learning to improve the performance of prediction procedures. In this paper, we study the prediction risk of variants of bagged predictors under the proportional asymptotics regime, in which the ratio of the number of features to the number of observations converges to a constant. Specifically, we propose a general strategy to analyze the prediction risk under squared error loss of bagged predictors using classical results on simple random sampling. Specializing the strategy, we derive the exact asymptotic risk of the bagged ridge and ridgeless predictors with an arbitrary number of bags under a well-specified linear model with arbitrary feature covariance matrices and signal vectors. Furthermore, we prescribe a generic cross-validation procedure to select the optimal subsample size for bagging and discuss its utility to eliminate the non-monotonic behavior of the limiting risk in the sample size (i.e., double or multiple descents). In demonstrating the proposed procedure for bagged ridge and ridgeless predictors, we thoroughly investigate the oracle properties of the optimal subsample size and provide an in-depth comparison between different bagging variants.
翻译:Bagging是统计与机器学习中常用的集成技术,用于提升预测方法的性能。本文研究在比例渐近框架下(特征数与观测数之比收敛至常数)Bagging预测变量变体的预测风险。具体而言,我们提出一种通用策略,基于简单随机抽样的经典结果分析Bagging预测变量在平方误差损失下的预测风险。通过该策略,我们推导出在具有任意特征协方差矩阵和信号向量的良设定线性模型中,任意Bag数下岭回归与无正则化岭回归Bagging预测变量的精确渐近风险。此外,我们设计了一种通用的交叉验证方法选取Bagging的最优子样本量,并讨论其在消除样本量下极限风险非单调行为(即双重或多重下降)中的作用。在演示所提方法对岭回归与无正则化岭回归Bagging预测变量的应用时,我们深入探究了最优子样本量的Oracle性质,并对不同Bagging变体进行了详尽对比。