Random forests are widely used prediction procedures, yet are typically described algorithmically rather than as statistical designs acting on a fixed dataset. We develop a finite-sample, design-based formulation of random forests in which each tree is an explicit randomized conditional regression function. This perspective yields an exact variance identity for the forest predictor that separates finite-aggregation variability from a structural dependence term that persists even under infinite aggregation. We further decompose both single-tree dispersion and inter-tree covariance using the laws of total variance and covariance, isolating two fundamental design mechanisms-reuse of training observations and alignment of data-adaptive partitions. These mechanisms induce a strict covariance floor, demonstrating that predictive variability cannot be eliminated by increasing the number of trees alone. The resulting framework clarifies how resampling, feature-level randomization, and split selection govern resolution, tree variability, and dependence, and establishes random forests as explicit finite-sample statistical designs whose behavior is determined by their underlying randomized construction.
翻译:随机森林是广泛使用的预测方法,但通常被算法化描述,而非作为作用于固定数据集的统计设计。我们建立了随机森林的有限样本、基于设计的表述,其中每棵树都是一个显式的随机化条件回归函数。这一视角导出了森林预测器的精确方差恒等式,将有限聚合变异性与即使在无限聚合下仍持续存在的结构性依赖项分离开来。我们进一步利用全方差定律和全协方差定律分解了单棵树离散度与树间协方差,分离出两个基本设计机制——训练观测的重复使用和数据自适应划分的对齐。这些机制引发了一个严格的协方差下限,表明仅通过增加树的数量无法消除预测变异性。所得框架阐明了重抽样、特征级随机化和分裂选择如何控制分辨率、树变异性及依赖性,并将随机森林确立为明确的有限样本统计设计,其行为由其底层的随机化构造决定。