Sequential Bootstrap for Out-of-Bag Error Estimation: A 100-Seed Replication Study and Variance-Structure Analysis

from arxiv, 22 pages, 9 tables, 1 appendix. v2: replication budget extended from 3 to 100 seeds; statistical analyses re-derived under cross-seed paired tests; Section 5 entirely rewritten; new Section 6.3 and Appendix A document the 3-seed vs 100-seed comparison. Code and data: https://github.com/Cheng-Peng0718/SB-OOB-100seed

Out-of-Bag (OOB) estimation is the standard internal diagnostic for bootstrap-aggregated tree ensembles. Under the classical multinomial bootstrap, the number of distinct training observations in each replicate, $U_b$, is itself random, but its contribution to OOB-based variability has rarely been isolated empirically. We use Sequential Bootstrap (SB) -- a resampling scheme that holds $U_b$ at a fixed target $k_n = \lfloor 0.632 n\rfloor$ -- as a controlled perturbation of the bootstrap mechanism, and ask whether stabilizing $U_b$ produces any measurable change in OOB-based diagnostics. We reproduce Breiman's five OOB experimental families on twelve synthetic and real datasets, but unlike the three-seed presentation common in this literature, we run 100 independent random seeds with 50 internal replications per seed, enabling formal paired statistical comparison (Wilcoxon signed-rank, paired-$t$, Pitman--Morgan variance test). We report three findings. First, OOB means are essentially insensitive to stabilization of $U_b$: of 57 (experiment, dataset, metric) cells under 100 seeds, only 6 reach $p<0.05$ on the paired mean comparison, and 4 of those 6 point in the opposite direction from what a 3-seed reading would suggest. Second, a narrow but reproducible effect survives at the variance level: SB reduces the cross-seed standard deviation of node-level classification diagnostics on real datasets while slightly increasing it on synthetic ones (permutation $p=0.026$); the Vehicle dataset exhibits a 21% cross-seed sd reduction (Pitman--Morgan $p=0.017$). Third, several directional claims that appear stable across three seeds flip sign under 100-seed replication, illustrating the cost of underpowered replication protocols. We therefore treat SB as a diagnostic tool for probing the distinct-sample-count term in the variance of OOB estimators, not as an alternative to the classical bootstrap.

翻译：袋外（OOB）估计是自助聚合树型集成模型中标准的内部诊断方法。在经典多项自助法下，每个重复样本中不同训练观测值的数量$U_b$本身是随机的，但其对基于OOB的变异性的贡献很少被实证分离。我们使用序贯自助法（SB）——一种将$U_b$固定为目标值$k_n = \lfloor 0.632 n\rfloor$的重抽样方案——作为对自助机制的受控扰动，并探究稳定化$U_b$是否会在基于OOB的诊断中产生可测量的变化。我们在12个合成和真实数据集上复现了Breiman的五类OOB实验，但与这类文献中常见的三种随机种子呈现方式不同，我们运行了100个独立的随机种子，每个种子进行50次内部重复，从而能够进行正式的配对统计比较（Wilcoxon符号秩检验、配对t检验、Pitman–Morgan方差检验）。我们报告三项发现。首先，OOB均值对$U_b$的稳定化基本不敏感：在100种随机种子下的57个（实验、数据集、指标）单元中，仅有6个在配对均值比较中达到$p<0.05$，且其中4个的方向与基于三种随机种子的解读所预示的相反。第二，在方差层面存在一个狭窄但可重复的效应：SB降低了真实数据集上节点级分类诊断的跨种子标准差，而在合成数据集上略有增加（置换检验$p=0.026$）；Vehicle数据集展现出21%的跨种子标准差降低（Pitman–Morgan检验$p=0.017$）。第三，若干在三种随机种子下看似稳定的方向性结论在100种随机种子重复中符号翻转，这揭示了低统计功效重复实验方案的代价。因此，我们将SB视为一种诊断工具，用于探析OOB估计量方差中的不同样本计数项，而非经典自助法的替代方案。