We give the first algorithm that maintains an approximate decision tree over an arbitrary sequence of insertions and deletions of labeled examples, with strong guarantees on the worst-case running time per update request. For instance, we show how to maintain a decision tree where every vertex has Gini gain within an additive $\alpha$ of the optimum by performing $O\Big(\frac{d\,(\log n)^4}{\alpha^3}\Big)$ elementary operations per update, where $d$ is the number of features and $n$ the maximum size of the active set (the net result of the update requests). We give similar bounds for the information gain and the variance gain. In fact, all these bounds are corollaries of a more general result, stated in terms of decision rules -- functions that, given a set $S$ of labeled examples, decide whether to split $S$ or predict a label. Decision rules give a unified view of greedy decision tree algorithms regardless of the example and label domains, and lead to a general notion of $\epsilon$-approximate decision trees that, for natural decision rules such as those used by ID3 or C4.5, implies the gain approximation guarantees above. The heart of our work provides a deterministic algorithm that, given any decision rule and any $\epsilon > 0$, maintains an $\epsilon$-approximate tree using $O\!\left(\frac{d\, f(n)}{n} \operatorname{poly}\frac{h}{\epsilon}\right)$ operations per update, where $f(n)$ is the complexity of evaluating the rule over a set of $n$ examples and $h$ is the maximum height of the maintained tree.
翻译:我们提出了首个算法,可在带标签样本的任意插入和删除序列中维护一棵近似决策树,并对每次更新请求的最坏运行时间提供强保证。例如,我们展示了如何通过每次更新执行 $O\Big(\frac{d\,(\log n)^4}{\alpha^3}\Big)$ 次基本操作(其中 $d$ 为特征数量,$n$ 为活跃集的最大规模,即更新请求的净结果)来维护一棵每个节点的基尼增益与最优值相差加法 $\alpha$ 的决策树。针对信息增益和方差增益,我们也给出了类似的界限。事实上,所有这些界限都是更一般性结果的推论,该结果以决策规则的形式表述——决策规则是一类函数,给定带标签样本集 $S$,决定是否分割 $S$ 或预测一个标签。决策规则为贪心决策树算法提供了统一视角,无论样本和标签域如何,并引出了 $\epsilon$-近似决策树的通用概念,对于自然决策规则(如 ID3 或 C4.5 所使用的规则),该概念可提供上述增益近似保证。我们工作的核心提供了一个确定性算法,给定任意决策规则和任意 $\epsilon > 0$,该算法通过每次更新执行 $O\!\left(\frac{d\, f(n)}{n} \operatorname{poly}\frac{h}{\epsilon}\right)$ 次操作维护一棵 $\epsilon$-近似树,其中 $f(n)$ 是在 $n$ 个样本集上评估规则的复杂度,$h$ 是所维护树的最大高度。