Regression trees are a popular machine learning algorithm that fit piecewise constant models by recursively partitioning the predictor space. In this paper, we focus on performing statistical inference in a data-dependent model obtained from the fitted tree. We introduce Randomized Regression Trees (RRT), a novel selective inference method that adds independent Gaussian noise to the gain function underlying the splitting rules of classic regression trees. The RRT method offers several advantages. First, it utilizes the added randomization to obtain an exact pivot using the full dataset, while accounting for the data-dependent structure of the fitted tree. Second, with a small amount of randomization, the RRT method achieves predictive accuracy similar to a model trained on the entire dataset. At the same time, it provides significantly more powerful inference than data splitting methods, which rely only on a held-out portion of the data for inference. Third, unlike data splitting approaches, it yields intervals that adapt to the signal strength in the data. Our empirical analyses highlight these advantages of the RRT method and its ability to convert a purely predictive algorithm into a method capable of performing reliable and powerful inference in the tree model.
翻译:回归树是一种流行的机器学习算法,它通过递归划分预测变量空间来拟合分段常数模型。本文重点研究如何对拟合树所得的数据依赖模型进行统计推断。我们提出随机化回归树(RRT)这一新颖的选择性推断方法,该方法通过在经典回归树分裂规则的增益函数中加入独立高斯噪声来实现。RRT方法具有多重优势:首先,它利用引入的随机化特性,在完整数据集上构建精确枢轴量,同时充分考虑拟合树的数据依赖结构;其次,通过微量随机化处理,RRT方法能达到与全数据集训练模型相近的预测精度,同时其推断效能显著优于仅依赖部分保留数据进行推断的数据分割方法;第三,与数据分割方法不同,RRT生成的置信区间能够自适应数据中的信号强度。实证分析凸显了RRT方法的这些优势,及其将纯预测算法转化为能够在树模型中执行可靠且高效推断的能力。