Decision trees are powerful machine learning algorithms, widely used in fields such as economics and medicine for their simplicity and interpretability. However, decision trees such as CART are prone to overfitting, especially when grown deep or the sample size is small. Conventional methods to reduce overfitting include pre-pruning and post-pruning, which constrain the growth of uninformative branches. In this paper, we propose a complementary approach by introducing a covariance-driven splitting criterion for regression trees (CovRT). This method is more robust to overfitting than the empirical risk minimization criterion used in CART, as it produces more balanced and stable splits and more effectively identifies covariates with true signals. We establish an oracle inequality of CovRT and prove that its predictive accuracy is comparable to that of CART in high-dimensional settings. We find that CovRT achieves superior prediction accuracy compared to CART in both simulations and real-world tasks.
翻译:决策树是强大的机器学习算法,因其简单性和可解释性而广泛应用于经济学和医学等领域。然而,CART等决策树容易过拟合,尤其是在树深度较大或样本量较小时。降低过拟合的传统方法包括预剪枝和后剪枝,这些方法限制了非信息分支的生长。本文提出一种补充方法,为回归树引入一种协方差驱动的分裂准则(CovRT)。该方法比CART中使用的经验风险最小化准则对过拟合更为稳健,因为它能产生更平衡、更稳定的分裂,并更有效地识别具有真实信号的协变量。我们建立了CovRT的Oracle不等式,并证明在高维设置下其预测精度与CART相当。我们发现,在模拟和实际任务中,CovRT相比CART实现了更优的预测精度。