The study of optimal decision trees has gained increasing attention in recent years; however, despite substantial progress, it still suffers from two major challenges: First, trees constructed by existing optimal decision tree (ODT) algorithms have limited expressivity, as they are typically restricted to axis-parallel splits or binary features. Second, these algorithms generally do not scale well to large datasets. These two challenges are intertwined: decision trees with more expressive splitting rules incur significantly higher combinatorial complexity, making the ODT problem even more difficult to solve when using complex splits. Building on He and Little's proper decision tree framework, we propose the first algorithm for solving the optimal hypersurface decision tree problem with time complexity $O\left(K!\times N^{DG+G}\right)$, where $G$ is a variable depends on both $K$ (tree size), $M$ (polynomial degree of hypersurface) and $D$ (data dimension). To the best of our knowledge, no known algorithm is capable of producing decision trees with hypersurface splits. Moreover, the proposed algorithm is inherently amenable to vectorization, enabling efficient parallelization. Its generic design pattern also allows it to be used to accelerate other ODT variants, such as axis-parallel decision trees. Furthermore, we identify an effective pruning strategy for the optimal hypersurface decision tree problem, which enables our algorithm to run significantly faster than the worst-case upper bound, together with an incremental procedure that reduces the cost of checking the feasibility of a single configuration from quadratic to linear time.
翻译:近年来,最优决策树的研究日益受到关注,然而,尽管取得了显著进展,但仍面临两大主要挑战:首先,现有最优决策树(ODT)算法构建的树表达力有限,通常局限于轴平行分割或二元特征;其次,这些算法通常难以高效扩展到大规模数据集。这两个挑战相互交织:具有更强表达力分割规则的决策树会带来显著更高的组合复杂度,使得采用复杂分割时求解ODT问题更加困难。基于He与Little的适当决策树框架,我们提出了首个求解最优超曲面决策树问题的算法,其时间复杂度为$O\left(K!\times N^{DG+G}\right)$,其中$G$是一个同时依赖于$K$(树规模)、$M$(超曲面的多项式次数)和$D$(数据维度)的变量。据我们所知,此前没有任何已知算法能够生成具有超曲面分割的决策树。此外,所提算法天然支持向量化,可实现高效并行化;其通用设计模式还可用于加速其他ODT变体,如轴平行决策树。进一步地,我们为最优超曲面决策树问题识别出一种有效的剪枝策略,使得算法实际运行速度显著优于最坏情况上界,同时引入增量式过程,将单个配置可行性检查的成本从二次时间降至线性时间。