Gradient Boosting Decision Trees (GBDTs) dominate tabular machine learning, with modern implementations like XGBoost, LightGBM, and CatBoost being based on Newton boosting: a second-order descent step in the space of decision trees. Despite its empirical success, the global convergence of Newton boosting is poorly understood compared to first-order boosting. In this paper, we introduce Restricted Newton Descent, which studies convex optimization with Newton's method on Hilbert spaces with inexact iterates, based on the concepts of cosine angle and weak gradient edge. Within this framework, we recover Newton boosting with GBDTs and classical finite-dimensional theory as special cases. We first prove that vanilla Newton boosting achieves a linear rate of convergence for smooth, strongly convex losses that satisfy a Hessian-dominance condition. To handle general convex losses with Lipschitz Hessians, we extend a recent gradient regularized Newton scheme to the restricted weak learner setting. This scheme minimally modifies the classical algorithm by introducing an adaptive $\ell_2$-regularization term proportional to the square root of the gradient norm at each iteration. We establish a $\mathcal{O}(\frac{1}{k^2})$ rate for this scheme, thereby obtaining a globally convergent second-order GBDT algorithm with a rate matching that of first-order boosting with Nesterov momentum. In numerical experiments, we show that our scheme converges while vanilla Newton boosting may diverge.
翻译:梯度提升决策树(GBDTs)在表格机器学习中占据主导地位,现代实现如XGBoost、LightGBM和CatBoost均基于牛顿提升法:在决策树空间中进行二阶下降步骤。尽管经验上成功,与一阶提升法相比,牛顿提升法的全局收敛性仍鲜有研究。本文提出受限牛顿下降法,该方法基于余弦角和弱梯度边的概念,在希尔伯特空间中使用牛顿法进行凸优化,并允许非精确迭代。在此框架下,我们将GBDTs的牛顿提升法和经典有限维理论作为特例统一。首先证明,对于满足Hessian优势条件的光滑强凸损失函数,普通牛顿提升法线性收敛。为处理具有Lipschitz Hessian矩阵的一般凸损失函数,我们将近期提出的梯度正则化牛顿方案扩展到受限弱学习器场景。该方案通过引入与每次迭代梯度范数平方根成比例的自适应$\ell_2$正则化项,对经典算法进行最小化修改。我们建立该方案的$\mathcal{O}(\frac{1}{k^2})$收敛速率,从而获得具有全局收敛性的二阶GBDT算法,其收敛速率与使用Nesterov动量的一阶提升法相当。数值实验表明,我们的方案收敛,而普通牛顿提升法可能发散。