Random Forests as Statistical Procedures: Design, Variance, and Dependence

from arxiv, 55 pages (35 page main text; 20 page supplement); 10 figures (9 main text; 1 supplement). Version 2: Added procedure-aligned synthetic resampling (PASR) estimation framework, pointwise prediction and confidence intervals, and comprehensive simulations validating theoretical claims

We develop a finite-sample, design-based theory for random forests in which each tree is a randomized conditional predictor acting on fixed covariates and the forest is their Monte Carlo average. An exact variance identity separates Monte Carlo error from a covariance floor that persists under infinite aggregation. The floor arises through two mechanisms: observation reuse, where the same training outcomes receive weight across multiple trees, and partition alignment, where independently generated trees discover similar conditional prediction rules. We prove the floor is strictly positive under minimal conditions and show that alignment persists even when sample splitting eliminates observation overlap entirely. We introduce procedure-aligned synthetic resampling (PASR) to estimate the covariance floor, decomposing the total prediction uncertainty of a deployed forest into interpretable components. For continuous outcomes, resulting prediction intervals achieve nominal coverage with a theoretically guaranteed conservative bias direction. For classification forests, the PASR estimator is asymptotically unbiased, providing the first pointwise confidence intervals for predicted conditional probabilities from a deployed forest. Nominal coverage is maintained across a range of design configurations for both outcome types, including high-dimensional settings. The underlying theory extends to any tree-based ensemble with an exchangeable tree-generating mechanism.

翻译：我们为随机森林建立了一个基于设计的有限样本理论，其中每棵树都是作用于固定协变量的随机化条件预测器，而森林则是它们的蒙特卡洛平均。一个精确的方差恒等式将蒙特卡洛误差与一个在无限聚合下持续存在的协方差基底分离开来。该基底通过两种机制产生：观测重用（相同的训练结果在多个树中获得权重）和分区对齐（独立生成的树发现相似的条件预测规则）。我们证明在最小条件下该基底严格为正，并表明即使当样本分割完全消除观测重叠时，对齐现象仍然存在。我们引入了程序对齐合成重采样（PASR）来估计协方差基底，将已部署森林的总预测不确定性分解为可解释的组成部分。对于连续结果，由此产生的预测区间在理论保证的保守偏差方向上达到名义覆盖水平。对于分类森林，PASR估计量是渐近无偏的，为已部署森林预测的条件概率提供了首个逐点置信区间。两种结果类型（包括高维设置）在一系列设计配置下均能保持名义覆盖水平。该基础理论可推广至任何具有可交换树生成机制的基于树的集成方法。