We examine the linear regression problem in a challenging high-dimensional setting with correlated predictors where the vector of coefficients can vary from sparse to dense. In this setting, we propose a combination of probabilistic variable screening with random projection tools as a viable approach. More specifically, we introduce a new data-driven random projection tailored to the problem at hand and derive a theoretical bound on the gain in expected prediction error over conventional random projections. The variables to enter the projection are screened by accounting for predictor correlation. To reduce the dependence on fine-tuning choices, we aggregate over an ensemble of linear models. A thresholding parameter is introduced to obtain a higher degree of sparsity. Both this parameter and the number of models in the ensemble can be chosen by cross-validation. In extensive simulations, we compare the proposed method with other random projection tools and with classical sparse and dense methods and show that it is competitive in terms of prediction across a variety of scenarios with different sparsity and predictor covariance settings. We also show that the method with cross-validation is able to rank the variables satisfactorily. Finally, we showcase the method on two real data applications.
翻译:我们研究了一个具有挑战性的高维线性回归问题,其中预测变量存在相关性,系数向量可能从稀疏变化到密集。在此背景下,我们提出了一种结合概率变量筛选与随机投影工具的可行方法。具体而言,我们针对该问题引入了一种新的数据驱动随机投影方法,并推导了其相较传统随机投影在期望预测误差上的理论增益上限。通过考虑预测变量相关性对进入投影的变量进行筛选。为减少对精细调参的依赖,我们对线性模型集成进行聚合。引入阈值参数以获得更高程度的稀疏性,该参数与集成模型数量均可通过交叉验证选择。在大量模拟中,我们将所提方法与其他随机投影工具及经典稀疏与密集方法进行比较,结果表明:在不同稀疏度和预测变量协方差设置的多种场景下,该方法在预测性能上具有竞争力。我们同时展示了该方法通过交叉验证能够令人满意地对变量进行排序。最后,我们在两个实际数据应用中验证了该方法的效果。