Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addition, they are more prone to overfitting and extrapolation issues than standard regression trees. In this paper we introduce PILOT, a new algorithm for linear model trees that is fast, regularized, stable and interpretable. PILOT trains in a greedy fashion like classic regression trees, but incorporates an $L^2$ boosting approach and a model selection rule for fitting linear models in the nodes. The abbreviation PILOT stands for $PI$ecewise $L$inear $O$rganic $T$ree, where `organic' refers to the fact that no pruning is carried out. PILOT has the same low time and space complexity as CART without its pruning. An empirical study indicates that PILOT tends to outperform standard decision trees and other linear model trees on a variety of data sets. Moreover, we prove its consistency in an additive model setting under weak assumptions. When the data is generated by a linear model, the convergence rate is polynomial.
翻译:线性模型树是在叶节点中融入线性模型的回归树。它保留了决策树的直观可解释性,同时能更好地捕捉标准决策树难以处理的线性关系。但现有的大多数线性模型树拟合方法耗时较长,难以扩展到大规模数据集。此外,与标准回归树相比,它们更容易出现过拟合和外推问题。本文提出PILOT算法——一种快速、正则化、稳定且可解释的线性模型树新算法。PILOT采用类似经典回归树的贪心训练方式,但在节点中融入$L^2$提升方法与模型选择规则来拟合线性模型。缩写PILOT代表分段线性有机树($PI$ecewise $L$inear $O$rganic $T$ree),其中"有机"指无需剪枝。PILOT在时间和空间复杂度上与无需剪枝的CART算法相当。实证研究表明,PILOT在多种数据集上通常优于标准决策树及其他线性模型树。此外,我们在弱假设下证明了加性模型设置中该算法的一致性。当数据由线性模型生成时,收敛速度呈多项式级。