Decision trees are a classic model for summarizing and classifying data. To enhance interpretability and generalization properties, it has been proposed to favor small decision trees. Accordingly, in the minimum-size decision tree training problem (MSDT), the input is a set of training examples in $\mathbb{R}^d$ with class labels and we aim to find a decision tree that classifies all training examples correctly and has a minimum number of nodes. MSDT is NP-hard and therefore presumably not solvable in polynomial time. Nevertheless, Komusiewicz et al. [ICML '23] developed a promising algorithmic paradigm called witness trees which solves MSDT efficiently if the solution tree is small. In this work, we test this paradigm empirically. We provide an implementation, augment it with extensive heuristic improvements, and scrutinize it on standard benchmark instances. The augmentations achieve a mean 324-fold (median 84-fold) speedup over the naive implementation. Compared to the state of the art they achieve a mean 32-fold (median 7-fold) speedup over the dynamic programming based MurTree solver [Demirovi\'c et al., J. Mach. Learn. Res. '22] and a mean 61-fold (median 25-fold) speedup over SAT-based implementations [Janota and Morgado, SAT '20]. As a theoretical result we obtain an improved worst-case running-time bound for MSDT.
翻译:决策树是一种用于数据概括与分类的经典模型。为提升可解释性与泛化特性,已有研究主张倾向于采用小型决策树。因此,在最小规模决策树训练问题中,输入为$\mathbb{R}^d$空间中带有类别标签的训练样本集,我们的目标是找到一棵能正确分类所有训练样本且节点数最少的决策树。该问题是NP难问题,因而很可能无法在多项式时间内求解。尽管如此,Komusiewicz等人[ICML '23]提出了一种名为见证树的前沿算法范式,该范式能在解树规模较小时高效求解最小规模决策树问题。本工作对该范式进行了实证检验:我们提供了算法实现,通过大量启发式改进进行增强,并在标准基准实例上进行了严格测试。增强版算法相较于基础实现取得了平均324倍(中位数84倍)的加速效果。与现有最优方法相比,相较于基于动态规划的MurTree求解器[Demirović等人, J. Mach. Learn. Res. '22]实现平均32倍(中位数7倍)加速,相较于基于可满足性理论的实现方案[Janota和Morgado, SAT '20]实现平均61倍(中位数25倍)加速。在理论成果方面,我们获得了改进的最小规模决策树问题最坏情况时间复杂度上界。