Tree-based methods are powerful nonparametric techniques in statistics and machine learning. However, their effectiveness, particularly in finite-sample settings, is not fully understood. Recent applications have revealed their surprising ability to distinguish transformations (which we call symbolic feature selection) that remain obscure under current theoretical understanding. This work provides a finite-sample analysis of tree-based methods from a ranking perspective. We link oracle partitions in tree methods to response rankings at local splits, offering new insights into their finite-sample behavior in regression and feature selection tasks. Building on this local ranking perspective, we extend our analysis in two ways: (i) We examine the global ranking performance of individual trees and ensembles, including Classification and Regression Trees (CART) and Bayesian Additive Regression Trees (BART), providing finite-sample oracle bounds, ranking consistency, and posterior contraction results. (ii) Inspired by the ranking perspective, we propose concordant divergence statistics $\mathcal{T}_0$ to evaluate symbolic feature mappings and establish their properties. Numerical experiments demonstrate the competitive performance of these statistics in symbolic feature selection tasks compared to existing methods.
翻译:树方法是统计学与机器学习中强大的非参数技术。然而,其有效性(尤其在有限样本场景下)尚未得到充分理解。近期应用揭示了其在区分变换(我们称之为符号特征选择)方面令人惊讶的能力,而这一能力在当前理论认知中仍不明确。本研究从排序视角对基于树的方法进行有限样本分析。我们将树方法中的理想划分与局部分裂处的响应排序相关联,从而为回归和特征选择任务中有限样本行为提供新的见解。基于这一局部排序视角,我们从两个方面拓展分析:(i)研究单棵树及集成模型(包括分类回归树CART与贝叶斯加性回归树BART)的全局排序性能,给出有限样本理想边界、排序一致性及后验收缩结果;(ii)受排序视角启发,提出评估符号特征映射的和谐散度统计量$\mathcal{T}_0$并建立其理论性质。数值实验表明,相较于现有方法,这些统计量在符号特征选择任务中具有竞争优势。