Regression trees are one of the oldest forms of AI models, and their predictions can be made without a calculator, which makes them broadly useful, particularly for high-stakes applications. Within the large literature on regression trees, there has been little effort towards full provable optimization, mainly due to the computational hardness of the problem. This work proposes a dynamic-programming-with-bounds approach to the construction of provably-optimal sparse regression trees. We leverage a novel lower bound based on an optimal solution to the k-Means clustering algorithm in 1-dimension over the set of labels. We are often able to find optimal sparse trees in seconds, even for challenging datasets that involve large numbers of samples and highly-correlated features.
翻译:回归树是最古老的人工智能模型形式之一,其预测无需借助计算器即可完成,这使得它们在广泛应用中极具价值,尤其是高风险场景。在回归树的大量文献中,因该问题的计算难度,鲜有研究致力于完全可证明的优化。本文提出一种基于边界动态规划的方法,用于构建可证明最优的稀疏回归树。我们利用一种新颖的下界,该下界基于标签集上一维k均值聚类算法的最优解。即使面对包含大量样本和高度相关特征的挑战性数据集,我们通常能在数秒内找到最优稀疏树。