Classification and Regression Tree (CART), Random Forest (RF) and Gradient Boosting Tree (GBT) are probably the most popular set of statistical learning methods. However, their statistical consistency can only be proved under very restrictive assumptions on the underlying regression function. As an extension to standard CART, the oblique decision tree (ODT), which uses linear combinations of predictors as partitioning variables, has received much attention. ODT tends to perform numerically better than CART and requires fewer partitions. In this paper, we show that ODT is consistent for very general regression functions as long as they are continuous. Then, we prove the consistency of the ODT-based random forest (ODRF), whether fully grown or not. Finally, we propose an ensemble of GBT for regression by borrowing the technique of orthogonal matching pursuit and study its consistency under very mild conditions on the tree structure. After refining existing computer packages according to the established theory, extensive experiments on real data sets show that both our ensemble boosting trees and ODRF have noticeable overall improvements over RF and other forests.
翻译:分类与回归树(CART)、随机森林(RF)和梯度提升树(GBT)可能是最流行的统计学习方法集。然而,它们的统计一致性仅能在对底层回归函数施加极为严格的假设下得到证明。作为标准CART的扩展,使用预测变量线性组合作为划分变量的倾斜决策树(ODT)受到了广泛关注。ODT通常在数值表现上优于CART,且所需划分次数更少。本文证明,对于非常一般的回归函数,只要其连续,ODT就是一致的。接着,我们证明了基于ODT的随机森林(ODRF)的一致性,无论其是否完全生长。最后,我们通过借鉴正交匹配追踪技术,提出了一种用于回归的GBT集成方法,并在非常温和的树结构条件下研究了其一致性。根据所建立的理论对现有计算机软件包进行改进后,在真实数据集上的大量实验表明,我们的集成提升树和ODRF在整体性能上均显著优于RF及其他森林方法。