Decision trees are widely used for non-linear modeling, as they capture interactions between predictors while producing inherently interpretable models. Despite their popularity, performing inference on the non-linear fit remains largely unaddressed. This paper focuses on classification trees and makes two key contributions. First, we introduce a novel tree-fitting method that replaces the greedy splitting of the predictor space in standard tree algorithms with a probabilistic approach. Each split in our approach is selected according to sampling probabilities defined by an exponential mechanism, with a temperature parameter controlling its deviation from the deterministic choice given data. Second, while our approach can fit a tree that with high probability coincides with the fit produced by standard tree algorithms at low temperatures, it is not merely predictive; unlike standard algorithms, it enables inference by taking into account the highly adaptive tree structure. Our method produces pivots directly from the sampling probabilities in the exponential mechanism. In theory, our pivots allow asymptotically valid inference on the parameters in the predictive fit, and in practice, our method delivers powerful inference without sacrificing predictive accuracy, in contrast to data splitting methods.
翻译:决策树因其能够捕捉预测变量间的交互作用并生成本质上可解释的模型,被广泛应用于非线性建模。尽管决策树广受欢迎,但对其非线性拟合结果进行统计推断在很大程度上仍未得到解决。本文聚焦于分类树,并做出两项关键贡献。首先,我们提出了一种新颖的树拟合方法,该方法用概率化方法替代了标准树算法中对预测变量空间的贪婪划分。在我们的方法中,每次划分均根据由指数机制定义的采样概率进行选择,其中温度参数控制其与给定数据下确定性选择的偏离程度。其次,尽管我们的方法在低温下能够以高概率拟合出与标准树算法结果一致的树,但它不仅具有预测功能;与标准算法不同,该方法通过考虑高度自适应的树结构来实现统计推断。我们的方法直接从指数机制的采样概率中生成枢轴量。理论上,我们的枢轴量允许对预测拟合中的参数进行渐近有效的推断;实践中,与数据分割方法相比,我们的方法在不牺牲预测准确性的前提下提供了强大的推断能力。