Decision trees are widely used for their low computational cost, good predictive performance, and ability to assess the importance of features. Though often used in practice for feature selection, the theoretical guarantees of these methods are not well understood. We here obtain a tight finite sample bound for the feature selection problem in linear regression using single-depth decision trees. We examine the statistical properties of these "decision stumps" for the recovery of the $s$ active features from $p$ total features, where $s \ll p$. Our analysis provides tight sample performance guarantees on high-dimensional sparse systems which align with the finite sample bound of $O(s \log p)$ as obtained by Lasso, improving upon previous bounds for both the median and optimal splitting criteria. Our results extend to the non-linear regime as well as arbitrary sub-Gaussian distributions, demonstrating that tree based methods attain strong feature selection properties under a wide variety of settings and further shedding light on the success of these methods in practice. As a byproduct of our analysis, we show that we can provably guarantee recovery even when the number of active features $s$ is unknown. We further validate our theoretical results and proof methodology using computational experiments.
翻译:决策树因其计算成本低、预测性能良好以及能够评估特征重要性而被广泛使用。尽管在实践中常用于特征选择,但这些方法的理论保证尚未得到充分理解。在此,我们针对线性回归中基于单层决策树的特征选择问题,获得了严格的有限样本界。我们研究了这些“决策树桩”从总数为$p$的特征中恢复$s$个主动特征(其中$s \ll p$)的统计特性。我们的分析为高维稀疏系统提供了严格的样本性能保证,其与Lasso方法获得的$O(s \log p)$有限样本界一致,并改进了基于中位数和最优分裂准则的先前界。我们的结果扩展到非线性情况以及任意次高斯分布,表明基于树的方法在多种设置下均具有强大的特征选择特性,并进一步揭示了这些方法在实践中成功的原理。作为分析的副产品,我们证明即使主动特征数量$s$未知,也能保证恢复。我们还通过计算实验进一步验证了理论结果和证明方法。