Random forests are widely used prediction procedures, yet are typically described algorithmically rather than as statistical designs acting on a fixed set of covariates. We develop a finite-sample, design-based formulation of random forests in which each tree is an explicit randomized conditional regression function. This perspective yields an exact variance identity for the forest predictor that separates finite-aggregation variability from a structural dependence term that persists even under infinite aggregation. We further decompose both single-tree dispersion and inter-tree covariance using the laws of total variance and covariance, isolating two fundamental design mechanisms-reuse of training observations and alignment of data-adaptive partitions. These mechanisms induce a strict covariance floor, demonstrating that predictive variability cannot be eliminated by increasing the number of trees alone. The resulting framework clarifies how resampling, feature-level randomization, and split selection govern resolution, tree variability, and dependence, and establishes random forests as explicit finite-sample statistical designs whose behavior is determined by their underlying randomized construction.
翻译:随机森林是广泛使用的预测方法,但通常以算法形式描述,而非作为作用于固定协变量集的统计设计。我们提出了一种有限样本、基于设计的随机森林表述,其中每棵树均为显式的随机化条件回归函数。该视角为森林预测器导出了一个精确的方差恒等式,将有限聚合变异性与即使在无限聚合下仍持续存在的结构性依赖项分离开来。我们进一步利用全方差定律和全协方差定律分解了单棵树离散度与树间协方差,分离出两个基本设计机制——训练观测值的重复利用与数据自适应划分的对齐。这些机制产生了严格的协方差下限,证明仅通过增加树的数量无法消除预测变异性。该框架阐明了重抽样、特征级随机化与分割选择如何控制分辨率、树变异性及依赖性,并将随机森林确立为显式的有限样本统计设计,其行为由其潜在的随机化构造决定。