Random forests and, more generally, (decision\nobreakdash-)tree ensembles are widely used methods for classification and regression. Recent algorithmic advances allow to compute decision trees that are optimal for various measures such as their size or depth. We are not aware of such research for tree ensembles and aim to contribute to this area. Mainly, we provide two novel algorithms and corresponding lower bounds. First, we are able to carry over and substantially improve on tractability results for decision trees, obtaining a $(6\delta D S)^S \cdot poly$-time algorithm, where $S$ is the number of cuts in the tree ensemble, $D$ the largest domain size, and $\delta$ is the largest number of features in which two examples differ. To achieve this, we introduce the witness-tree technique which also seems promising for practice. Second, we show that dynamic programming, which has been successful for decision trees, may also be viable for tree ensembles, providing an $\ell^n \cdot poly$-time algorithm, where $\ell$ is the number of trees and $n$ the number of examples. Finally, we compare the number of cuts necessary to classify training data sets for decision trees and tree ensembles, showing that ensembles may need exponentially fewer cuts for increasing number of trees.
翻译:随机森林以及更一般的(决策)树集成是分类与回归中广泛使用的方法。近期的算法进展使得能够针对决策树的大小或深度等不同度量计算最优决策树。我们尚未发现针对树集成的此类研究,并旨在为这一领域做出贡献。主要地,我们提出了两种新颖算法及其相应的下界。首先,我们能够继承并显著改进决策树的可处理性结果,得到一个$(6\delta D S)^S \cdot poly$时间的算法,其中$S$是树集成中的切割数量,$D$是最大的域规模,$\delta$是两个样本在特征上差异的最大数目。为实现这一目标,我们引入了见证树技术,该技术在实践中似乎也具有潜力。其次,我们展示了在决策树中成功的动态规划方法可能也适用于树集成,提供了一个$\ell^n \cdot poly$时间的算法,其中$\ell$是树的数量,$n$是样本数。最后,我们比较了决策树和树集成在分类训练数据集时所需的切割数量,表明随着树数量的增加,集成可能需要指数级更少的切割。