Random forests utilize bootstrap sampling to create an individual training set for each component tree. This involves sampling with replacement, with the number of instances equal to the size of the original training set ($N$). Research literature indicates that drawing fewer than $N$ observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is called the bootstrap rate (BR). Sampling more than $N$ observations (BR $>$ 1) has been explored in the literature only to a limited extent and has generally proven ineffective. In this paper, we re-examine this approach using 36 diverse datasets and consider BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that such parameterization can result in statistically significant improvements in classification accuracy compared to standard settings (BR $\leq$ 1). Furthermore, we investigate what the optimal BR depends on and conclude that it is more a property of the dataset than a dependence on the random forest hyperparameters. Finally, we develop a binary classifier to predict whether the optimal BR is $\leq$ 1 or $>$ 1 for a given dataset, achieving between 81.88\% and 88.81\% accuracy, depending on the experiment configuration.
翻译:随机森林通过Bootstrap采样为每棵组成树生成独立的训练集。该方法涉及有放回抽样,其样本数量等于原始训练集的大小($N$)。研究文献表明,抽取少于$N$个观测样本同样能获得令人满意的结果。每个Bootstrap样本中的观测数量与训练实例总数之比称为Bootstrap率(BR)。在现有文献中,对抽取超过$N$个观测样本(BR $>$ 1)的探讨较为有限,且普遍认为该方法效果不佳。本文通过36个多样化数据集重新审视这一方法,考察BR值在1.2至5.0范围内的表现。与先前研究结论相反,我们证明相较于标准设置(BR $\leq$ 1),此类参数化配置能在分类准确率上产生统计显著的提升。进一步地,我们探究了最优BR的依赖因素,并得出结论:其更多取决于数据集本身的特性,而非随机森林的超参数设置。最后,我们开发了一个二分类器,用于预测给定数据集的最优BR是$\leq$ 1还是$>$ 1,根据实验配置的不同,该分类器取得了81.88\%至88.81\%的准确率。