Evidence suggests that oblique splits can significantly enhance the performance of decision trees. This paper explores the optimization of high-dimensional oblique splits for decision tree construction, establishing the Sufficient Impurity Decrease (SID) convergence that takes into account $s_0$-sparse oblique splits. We demonstrate that the SID function class expands as sparsity parameter $s_0$ increases, enabling the model to capture complex data-generating processes such as the $s_0$-dimensional XOR function. Thus, $s_0$ represents the unknown potential complexity of the underlying data-generating function. Furthermore, we establish that learning these complex functions necessitates greater computational resources. This highlights a fundamental trade-off between statistical accuracy, which is governed by the $s_0$-dependent size of the SID function class, and computational cost. Particularly, for challenging problems, the required candidate oblique split set can become prohibitively large, rendering standard ensemble approaches computationally impractical. To address this, we propose progressive trees that optimize oblique splits through an iterative refinement process rather than a single-step optimization. These splits are integrated alongside traditional orthogonal splits into ensemble models like Random Forests to enhance finite-sample performance. The effectiveness of our approach is validated through simulations and real-data experiments, where it consistently outperforms various existing oblique tree models.
翻译:研究表明,倾斜分割能显著提升决策树的性能。本文探讨了决策树构建中高维倾斜分割的优化问题,建立了考虑 $s_0$-稀疏倾斜分割的充分不纯度下降(SID)收敛性。我们证明了SID函数类随着稀疏参数 $s_0$ 的增大而扩展,使模型能够捕捉复杂的数据生成过程,例如 $s_0$ 维XOR函数。因此,$s_0$ 代表了潜在数据生成函数的未知复杂度。此外,我们证明了学习这些复杂函数需要更多的计算资源。这凸显了统计精度(由SID函数类的 $s_0$ 依赖规模决定)与计算成本之间的基本权衡。特别是对于具有挑战性的问题,所需的候选倾斜分割集可能变得极其庞大,使得标准的集成方法在计算上不可行。为解决此问题,我们提出了渐进树,它通过迭代细化过程而非单步优化来优化倾斜分割。这些分割与传统的正交分割一同被集成到如随机森林等集成模型中,以提升有限样本下的性能。我们通过仿真和真实数据实验验证了该方法的有效性,其表现一致优于多种现有的倾斜树模型。