Decision tree learning has long been a central topic in theoretical computer science, driven by its practical importance. A fundamental and widely used method for decision tree construction is the top-down greedy heuristic, which recursively splits on the most influential variable. Despite its empirical success, theoretical analysis of this heuristic has been limited. A recent breakthrough by Blanc et al. (ITCS, 2020) provided the first rigorous theoretical guarantees for the greedy approach, but only under the uniform distribution. We extend this analysis to the more general and practically relevant setting of arbitrary product distributions. Our main result shows that for any function $f$ computable by an optimal decision tree of size $s$, maximum depth $D_{\text{opt}}$, and average depth $Δ_{\text{opt}}$, the greedy heuristic constructs an $ε$-approximating tree whose size grows at most with $\exp\bigl(Δ_{\text{opt}} D_{\text{opt}} \log(e/ε)\bigr)$. In the special case where the optimal tree is a full binary tree, this bound improves upon the bound of Blanc et al. and holds under a strictly broader class of distributions. Moreover, we present an algorithm based on the top-down greedy heuristic that is entirely parameter-free -- it requires no prior knowledge of the optimal tree's size or depth -- offering a practical advantage over Blanc et al.'s method.
翻译:决策树学习因其重要的实际应用,长期以来一直是理论计算机科学的核心课题。自上而下贪心启发式方法是一种基础且广泛使用的决策树构建方法,它通过递归地分裂影响最大的变量来构建决策树。尽管该方法在经验上取得了成功,但其理论分析一直较为有限。Blanc等人(ITCS, 2020)近期取得突破,首次为贪心方法提供了严格的理论保证,但仅限于均匀分布条件下。我们将此分析推广到更一般且更具实际意义的任意乘积分布场景。我们的主要结果表明:对于任何可由最优决策树计算且规模为 $s$、最大深度为 $D_{\text{opt}}$、平均深度为 $Δ_{\text{opt}}$ 的函数 $f$,贪心启发式方法构建的 $ε$-近似树的规模至多以 $\exp\bigl(Δ_{\text{opt}} D_{\text{opt}} \log(e/ε)\bigr)$ 增长。在最优树为完全二叉树这一特殊情况下,该界优于Blanc等人的结果,且适用于更严格的分布类别。此外,我们提出了一种基于自上而下贪心启发式的完全无参数算法——它无需预先知道最优树的规模或深度——相较于Blanc等人的方法具有实际优势。