We give the first algorithm that maintains an approximate decision tree over an arbitrary sequence of insertions and deletions of labeled examples, with strong guarantees on the worst-case running time per update request. For instance, we show how to maintain a decision tree where every vertex has Gini gain within an additive $\alpha$ of the optimum by performing $O\Big(\frac{d\,(\log n)^4}{\alpha^3}\Big)$ elementary operations per update, where $d$ is the number of features and $n$ the maximum size of the active set (the net result of the update requests). We give similar bounds for the information gain and the variance gain. In fact, all these bounds are corollaries of a more general result, stated in terms of decision rules -- functions that, given a set $S$ of labeled examples, decide whether to split $S$ or predict a label. Decision rules give a unified view of greedy decision tree algorithms regardless of the example and label domains, and lead to a general notion of $\epsilon$-approximate decision trees that, for natural decision rules such as those used by ID3 or C4.5, implies the gain approximation guarantees above. The heart of our work provides a deterministic algorithm that, given any decision rule and any $\epsilon > 0$, maintains an $\epsilon$-approximate tree using $O\!\left(\frac{d\, f(n)}{n} \operatorname{poly}\frac{h}{\epsilon}\right)$ operations per update, where $f(n)$ is the complexity of evaluating the rule over a set of $n$ examples and $h$ is the maximum height of the maintained tree.
翻译:我们给出了首个能够处理任意带标签样本插入与删除序列,并保证每个更新操作最坏情况下运行时间的近似决策树维护算法。例如,我们展示了如何通过每次更新执行 \(O\Big(\frac{d\,(\log n)^4}{\alpha^3}\Big)\) 次基本操作来维护一棵决策树,其中每个顶点的基尼增益与最优值的加性误差在 \(\alpha\) 以内。这里 \(d\) 是特征数量,\(n\) 是活动集(更新请求的净结果)的最大规模。对于信息增益和方差增益,我们给出了类似的界。实际上,所有这些界都是更一般结果的推论,该结果以决策规则的形式表述——决策规则是一类函数,给定带标签样本集 \(S\) 后,能判断是否对 \(S\) 进行分裂或预测标签。决策规则为贪心决策树算法提供了统一视角,无论样本和标签域如何。它还引出了 \(\epsilon\)-近似决策树的一般概念,对于自然决策规则(如 ID3 或 C4.5 所用的规则),该概念蕴含了上述增益近似保证。我们工作的核心提供了一个确定性算法,给定任意决策规则和任意 \(\epsilon > 0\),可通过每次更新执行 \(O\!\left(\frac{d\, f(n)}{n} \operatorname{poly}\frac{h}{\epsilon}\right)\) 次操作来维护一棵 \(\epsilon\)-近似树,其中 \(f(n)\) 是在 \(n\) 个样本集上评估规则的计算复杂度,\(h\) 是维护树的最大高度。