We examine the linear regression problem in a challenging high-dimensional setting with correlated predictors to explain and predict relevant quantities, with explicitly allowing the regression coefficient to vary from sparse to dense. Most classical high-dimensional regression estimators require some degree of sparsity. We discuss the more recent concepts of variable screening and random projection as computationally fast dimension reduction tools, and propose a new random projection matrix tailored to the linear regression problem with a theoretical bound on the gain in expected prediction error over conventional random projections. Around this new random projection, we built the Sparse Projected Averaged Regression (SPAR) method combining probabilistic variable screening steps with the random projection steps to obtain an ensemble of small linear models. In difference to existing methods, we introduce a thresholding parameter to obtain some degree of sparsity. In extensive simulations and two real data applications we guide through the elements of this method and compare prediction and variable selection performance to various competitors. For prediction, our method performs at least as good as the best competitors in most settings with a high number of truly active variables, while variable selection remains a hard task for all methods in high dimensions.
翻译:我们研究了具有挑战性的高维线性回归问题,其中预测变量之间存在相关性,旨在解释和预测相关变量,并明确允许回归系数从稀疏到稠密变化。大多数经典的高维回归估计量都需要一定程度的稀疏性。我们讨论了变量筛选和随机投影作为计算快速的降维工具的最新概念,并提出了一种新的随机投影矩阵,该矩阵针对线性回归问题设计,具有相对于传统随机投影在期望预测误差上改进的理论界。基于这一新的随机投影,我们构建了稀疏投影平均回归(SPAR)方法,该方法将概率变量筛选步骤与随机投影步骤相结合,以获得一组小型线性模型的集成。与现有方法不同,我们引入了一个阈值参数以实现一定程度的稀疏性。通过大量模拟和两个实际数据应用,我们介绍了该方法的各个要素,并将其预测性能与变量选择性能与多种竞争方法进行比较。在预测方面,当真正活跃变量数量较多时,我们的方法在大多数情况下的表现至少与最佳竞争方法相当,而在高维设置下,所有方法的变量选择仍然是一项艰巨任务。