This paper advances the theoretical understanding of active learning label complexity for decision trees as binary classifiers. We make two main contributions. First, we provide the first analysis of the disagreement coefficient for decision trees-a key parameter governing active learning label complexity. Our analysis holds under two natural assumptions required for achieving polylogarithmic label complexity, (i) each root-to-leaf path queries distinct feature dimensions, and (ii) the input data has a regular, grid-like structure. We show these assumptions are essential, as relaxing them leads to polynomial label complexity. Second, we present the first general active learning algorithm for binary classification that achieves a multiplicative error guarantee, producing a $(1+ε)$-approximate classifier. By combining these results, we design an active learning algorithm for decision trees that uses only a polylogarithmic number of label queries in the dataset size, under the stated assumptions. Finally, we establish a label complexity lower bound, showing our algorithm's dependence on the error tolerance $ε$ is close to optimal.
翻译:本文推进了对决策树作为二元分类器的主动学习标签复杂度的理论理解。我们做出了两项主要贡献。首先,我们首次分析了决策树的不一致系数——这是控制主动学习标签复杂度的关键参数。我们的分析在实现多对数标签复杂度所需的两项自然假设下成立:(i) 每条从根节点到叶节点的路径查询不同的特征维度,以及 (ii) 输入数据具有规则的网格状结构。我们证明了这些假设是必要的,因为放宽它们会导致多项式标签复杂度。其次,我们提出了首个用于二元分类的通用主动学习算法,该算法实现了乘法误差保证,能产生一个$(1+ε)$-近似分类器。通过结合这些结果,我们设计了一种用于决策树的主动学习算法,在所述假设下,该算法仅使用数据集大小的多对数数量级的标签查询。最后,我们建立了一个标签复杂度下界,表明我们算法对误差容忍度$ε$的依赖接近最优。