The projection pursuit regression (PPR) has played an important role in the development of statistics and machine learning. However, when compared to other established methods like random forests (RF) and support vector machines (SVM), PPR has yet to showcase a similar level of accuracy as a statistical learning technique. In this paper, we revisit the estimation of PPR and propose an \textit{optimal} greedy algorithm and an ensemble approach via "feature bagging", hereafter referred to as ePPR, aiming to improve the efficacy. Compared to RF, ePPR has two main advantages. Firstly, its theoretical consistency can be proved for more general regression functions as long as they are $L^2$ integrable, and higher consistency rates can be achieved. Secondly, ePPR does not split the samples, and thus each term of PPR is estimated using the whole data, making the minimization more efficient and guaranteeing the smoothness of the estimator. Extensive comparisons based on real data sets show that ePPR is more efficient in regression and classification than RF and other competitors. The efficacy of ePPR, a variant of Artificial Neural Networks (ANN), demonstrates that with suitable statistical tuning, ANN can equal or even exceed RF in dealing with small to medium-sized datasets. This revelation challenges the widespread belief that ANN's superiority over RF is limited to processing extensive sample sizes.
翻译:投影寻踪回归(PPR)在统计学和机器学习的发展中扮演了重要角色。然而,与随机森林(RF)和支持向量机(SVM)等其他成熟方法相比,PPR作为统计学习技术尚未展现出同等水平的准确性。本文重新审视PPR的估计问题,提出了一种最优贪婪算法和一种基于“特征装袋”的集成方法(以下简称ePPR),旨在提升其有效性。与RF相比,ePPR具有两个主要优势。首先,对于更一般的回归函数(只要满足$L^2$可积),可以证明其理论一致性,且能实现更高的一致性速率。其次,ePPR不对样本进行分割,因此PPR的每一项均使用全部数据进行估计,使得最小化过程更高效,并保证了估计量的平滑性。基于真实数据集的广泛比较表明,ePPR在回归和分类任务中比RF及其他竞争方法更高效。ePPR作为人工神经网络(ANN)的一种变体,其有效性表明,通过适当的统计调优,ANN在处理中小型数据集时能够达到甚至超越RF的性能。这一发现挑战了ANN相对RF的优势仅限于大样本处理的普遍认知。