Random forests (RFs) utilize bootstrap sampling to generate individual training sets for each component tree by sampling with replacement, with the sample size typically equal to that of the original training set ($N$). Previous research indicates that drawing fewer than $N$ observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is referred to as the bootstrap rate (BR). Sampling more than $N$ observations (BR $>$ 1.0) has been explored only to a limited extent and has generally been considered ineffective. In this paper, we revisit this setup using 36 diverse datasets, evaluating BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that higher BR values can lead to statistically significant improvements in classification accuracy compared to standard settings (BR $\leq$ 1.0). Furthermore, we analyze how BR affects the leaf structure of decision trees within the RF and investigate factors influencing the optimal BR. Our results indicate that the optimal BR is primarily determined by the characteristics of the data set rather than the RF hyperparameters.
翻译:随机森林(RFs)通过有放回抽样生成每棵组成树的独立训练集,其样本大小通常等于原始训练集的大小($N$)。先前研究表明,抽取少于$N$个观测样本也能获得令人满意的结果。每个自举样本中的观测数量与训练实例总数的比值被称为自举率(BR)。抽取超过$N$个观测样本(BR $>$ 1.0)的情况仅在有限范围内被探索过,且通常被认为效果不佳。本文中,我们使用36个多样化数据集重新审视这一设置,评估了BR值从1.2到5.0的范围。与先前发现相反,我们证明相较于标准设置(BR $\leq$ 1.0),更高的BR值能够带来分类准确率的统计学显著提升。此外,我们分析了BR如何影响RF内决策树的叶子结构,并探究了影响最优BR的因素。我们的结果表明,最优BR主要由数据集的特性决定,而非RF的超参数。