Tree-based methods such as Random Forests are learning algorithms that have become an integral part of the statistical toolbox. The last decade has shed some light on theoretical properties such as their consistency for regression tasks. However, the usual proofs assume normal error terms as well as an additive regression function and are rather technical. We overcome these issues by introducing a simple and catchy technique for proving consistency under quite general assumptions. To this end, we introduce a new class of naive trees, which do the subspacing completely at random and independent of the data. We then give a direct proof of their consistency. Using them to bound the error of more complex tree-based approaches such as univariate and multivariate CARTs, Extra Randomized Trees, or Random Forests, we deduce the consistency of all of them. Since naive trees appear to be too simple for actual application, we further analyze their finite sample properties in a simulation and small benchmark study. We find a slow convergence speed and a rather poor predictive performance. Based on these results, we finally discuss to what extent consistency proofs help to justify the application of complex learning algorithms.
翻译:基于树的算法(如随机森林)已成为统计工具箱中不可或缺的学习方法。过去十年间,此类算法在回归任务中的一致性等理论性质逐渐得到阐明。然而,通常的证明过程不仅假设误差项服从正态分布且回归函数具有可加性,而且技术性较强。本文通过引入一种简洁直观的证明技巧,在相当一般的假设下克服了这些局限。为此,我们定义了一类全新的朴素树——其子空间划分完全随机且独立于数据,并直接给出了其一致性的证明。利用朴素树作为误差上界工具,我们进一步证明了单变量/多变量CART、极端随机树及随机森林等复杂树基方法的一致性。由于朴素树过于简单而难以实际应用,我们通过模拟实验与小规模基准研究分析了其有限样本性质,发现其收敛速度缓慢且预测性能较差。基于这些结果,我们最终探讨了一致性证明在多大程度上能为复杂学习算法的实际应用提供理论支撑。