Models based on recursive adaptive partitioning such as decision trees and their ensembles are popular for high-dimensional regression as they can potentially avoid the curse of dimensionality. Because empirical risk minimization (ERM) is computationally infeasible, these models are typically trained using greedy algorithms. Although effective in many cases, these algorithms have been empirically observed to get stuck at local optima. We explore this phenomenon in the context of learning sparse regression functions over $d$ binary features, showing that when the true regression function $f^*$ does not satisfy Abbe et al. (2022)'s Merged Staircase Property (MSP), greedy training requires $\exp(\Omega(d))$ to achieve low estimation error. Conversely, when $f^*$ does satisfy MSP, greedy training can attain small estimation error with only $O(\log d)$ samples. This dichotomy mirrors that of two-layer neural networks trained with stochastic gradient descent (SGD) in the mean-field regime, thereby establishing a head-to-head comparison between SGD-trained neural networks and greedy recursive partitioning estimators. Furthermore, ERM-trained recursive partitioning estimators achieve low estimation error with $O(\log d)$ samples irrespective of whether $f^*$ satisfies MSP, thereby demonstrating a statistical-computational trade-off for greedy training. Our proofs are based on a novel interpretation of greedy recursive partitioning using stochastic process theory and a coupling technique that may be of independent interest.
翻译:基于递归自适应划分的模型(如决策树及其集成)在高维回归中广受欢迎,因为它们可能避免维度灾难。由于经验风险最小化(ERM)在计算上不可行,这些模型通常使用贪心算法进行训练。尽管在许多情况下有效,但这些算法在经验上常被观察到会陷入局部最优。我们在学习$d$个二元特征上的稀疏回归函数背景下探讨这一现象,证明当真实回归函数$f^*$不满足Abbe等人(2022)提出的合并阶梯性质(MSP)时,贪心训练需要$\exp(\Omega(d))$个样本才能达到低估计误差。反之,当$f^*$满足MSP时,贪心训练仅需$O(\log d)$个样本即可获得较小的估计误差。这种二分现象与平均场机制下随机梯度下降(SGD)训练的两层神经网络相似,从而建立了SGD训练神经网络与贪心递归划分估计器之间的直接比较。此外,ERM训练的递归划分估计器无论$f^*$是否满足MSP,均能以$O(\log d)$个样本实现低估计误差,从而证明了贪心训练存在统计-计算权衡。我们的证明基于对贪心递归划分的一种新颖的随机过程理论解释以及一种可能具有独立价值的耦合技术。