The random forest (RF) algorithm has become a very popular prediction method for its great flexibility and promising accuracy. In RF, it is conventional to put equal weights on all the base learners (trees) to aggregate their predictions. However, the predictive performances of different trees within the forest can be very different due to the randomization of the embedded bootstrap sampling and feature selection. In this paper, we focus on RF for regression and propose two optimal weighting algorithms, namely the 1 Step Optimal Weighted RF (1step-WRF$_\mathrm{opt}$) and 2 Steps Optimal Weighted RF (2steps-WRF$_\mathrm{opt}$), that combine the base learners through the weights determined by weight choice criteria. Under some regularity conditions, we show that these algorithms are asymptotically optimal in the sense that the resulting squared loss and risk are asymptotically identical to those of the infeasible but best possible model averaging estimator. Numerical studies conducted on real-world data sets indicate that these algorithms outperform the equal-weight forest and two other weighted RFs proposed in existing literature in most cases.
翻译:随机森林(RF)算法因其极高的灵活性和良好的预测精度,已成为一种广受欢迎的预测方法。在RF中,通常对所有基学习器(决策树)赋予相等的权重以聚合其预测结果。然而,由于嵌入的Bootstrap抽样和特征选择的随机性,森林中不同决策树的预测性能可能存在显著差异。本文聚焦于回归问题的RF,提出了两种最优加权算法,即单步最优加权RF(1step-WRF$_\mathrm{opt}$)与两步最优加权RF(2steps-WRF$_\mathrm{opt}$),通过基于权重选择准则确定的权重来组合基学习器。在一定的正则性条件下,我们证明了这些算法具有渐近最优性,即其产生的平方损失与风险在渐近意义上与不可实现但理想的最优模型平均估计量相同。基于真实数据集的数值研究表明,在大多数情况下,这些算法优于等权森林及现有文献中提出的两种加权RF方法。