The standard procedure to decide on the complexity of a CART regression tree is to use cross-validation with the aim of obtaining a predictor that generalises well to unseen data. The randomness in the selection of folds implies that the selected CART regression tree is not a deterministic function of the data. Moreover, the cross-validation procedure may become time consuming and result in inefficient use of training data. We propose a simple deterministic in-sample method that can be used for stopping the growing of a CART regression tree based on node-wise statistical tests. This testing procedure is derived using a connection to change point detection, where the null hypothesis corresponds to no signal. The suggested $p$-value based procedure allows us to consider covariate vectors of arbitrary dimension and allows us to bound the $p$-value of an entire tree from above. Further, we show that the test detects a not too weak signal with a high probability, given a not too small sample size. We illustrate our methodology and the asymptotic results on both simulated and real world data. Additionally, we illustrate how the $p$-value based method can be used to construct a deterministic piece-wise constant auto-calibrated predictor based on a given black-box predictor.
翻译:决定CART回归树复杂度的标准流程是使用交叉验证,旨在获得对未见数据具有良好泛化能力的预测器。折迭选择的随机性意味着所选的CART回归树并非数据的确定性函数。此外,交叉验证过程可能耗时且导致训练数据使用效率低下。我们提出一种简单的确定性样本内方法,该方法基于节点统计检验来停止CART回归树的生长。此检验过程通过结合变点检测理论推导得出,其中零假设对应无信号状态。所提出的基于$p$值的检验方法允许我们处理任意维度的协变量向量,并能从上方约束整棵树的$p$值。进一步,我们证明在样本量足够大的条件下,该检验能以高概率检测到非微弱信号。我们通过模拟数据和真实世界数据展示了该方法论及渐近结果。此外,我们还说明了如何利用基于$p$值的方法,基于给定的黑盒预测器构建确定性分段常数自校准预测器。