We prove that it is NP-hard to properly PAC learn decision trees with queries, resolving a longstanding open problem in learning theory (Bshouty 1993; Guijarro-Lavin-Raghavan 1999; Mehta-Raghavan 2002; Feldman 2016). While there has been a long line of work, dating back to (Pitt-Valiant 1988), establishing the hardness of properly learning decision trees from random examples, the more challenging setting of query learners necessitates different techniques and there were no previous lower bounds. En route to our main result, we simplify and strengthen the best known lower bounds for a different problem of Decision Tree Minimization (Zantema-Bodlaender 2000; Sieling 2003). On a technical level, we introduce the notion of hardness distillation, which we study for decision tree complexity but can be considered for any complexity measure: for a function that requires large decision trees, we give a general method for identifying a small set of inputs that is responsible for its complexity. Our technique even rules out query learners that are allowed constant error. This contrasts with existing lower bounds for the setting of random examples which only hold for inverse-polynomial error. Our result, taken together with a recent almost-polynomial time query algorithm for properly learning decision trees under the uniform distribution (Blanc-Lange-Qiao-Tan 2022), demonstrates the dramatic impact of distributional assumptions on the problem.
翻译:我们证明了用查询正确PAC学习决策树是NP难的,解决了学习理论中一个长期未解决的开放问题(Bshouty 1993; Guijarro-Lavin-Raghavan 1999; Mehta-Raghavan 2002; Feldman 2016)。虽然自(Pitt-Valiant 1988)以来已有大量工作确立了从随机样本正确学习决策树的难度,但更具挑战性的查询学习器设定需要不同的技术,且此前没有任何下界。在通往主要结果的过程中,我们简化并强化了决策树最小化这一不同问题的已知最佳下界(Zantema-Bodlaender 2000; Sieling 2003)。在技术层面,我们引入了难度蒸馏的概念,虽然我们针对决策树复杂度研究该概念,但它可以适用于任何复杂度度量:对于需要大型决策树的函数,我们给出了一种通用方法,用于识别出一组对其复杂度负责的小规模输入。我们的技术甚至排除了允许常数误差的查询学习器。这与现有的随机样本设定下仅适用于逆多项式误差的下界形成对比。我们的结果,结合近期在均匀分布下用几乎多项式时间查询算法正确学习决策树的工作(Blanc-Lange-Qiao-Tan 2022),展示了分布假设对该问题的巨大影响。