Decision trees are widely used for non-linear modeling, as they capture interactions between predictors while producing inherently interpretable models. Despite their popularity, performing inference on the non-linear fit remains largely unaddressed. This paper focuses on classification trees and makes two key contributions. First, we introduce a novel tree-fitting method that replaces the greedy splitting of the predictor space in standard tree algorithms with a probabilistic approach. Each split in our approach is selected according to sampling probabilities defined by an exponential mechanism, with a temperature parameter controlling its deviation from the deterministic choice given data. Second, while our approach can fit a tree that with high probability coincides with the fit produced by standard tree algorithms at low temperatures, it is not merely predictive; unlike standard algorithms, it enables inference by taking into account the highly adaptive tree structure. Our method produces pivots directly from the sampling probabilities in the exponential mechanism. In theory, our pivots allow asymptotically valid inference on the parameters in the predictive fit, and in practice, our method delivers powerful inference without sacrificing predictive accuracy, in contrast to data splitting methods.
翻译:决策树广泛应用于非线性建模,因其能捕捉预测变量间的交互作用,同时生成具有内在可解释性的模型。尽管决策树广受欢迎,但针对非线性拟合的统计推断问题始终未得到充分解决。本文聚焦分类树领域,做出两项关键贡献:首先,提出一种新型树拟合方法,将标准树算法中预测空间的自适应分裂替换为概率化方案。在该方法中,每个分裂节点依据指数机制定义的采样概率进行选择,并通过温度参数控制其偏离数据驱动确定性选择的程度。其次,尽管本方法在低温条件下能以高概率生成与标准树算法一致的拟合结果,但其价值不仅限于预测——与标准算法不同,本方法通过考虑高度自适应的树结构实现统计推断。我们的方法可直接从指数机制中的采样概率导出枢轴量。理论上,该枢轴量能对预测拟合中的参数实现渐近有效推断;实践中,与数据分裂方法相比,本方法在保持预测精度的同时提供了强大的推断能力。