We address the challenge of correlated predictors in high-dimensional GLMs, where regression coefficients range from sparse to dense, by proposing a data-driven random projection method. This is particularly relevant for applications where the number of predictors is (much) larger than the number of observations and the underlying structure -- whether sparse or dense -- is unknown. We achieve this by using ridge-type estimates for variable screening and random projection to incorporate information about the response-predictor relationship when performing dimensionality reduction. We demonstrate that a ridge estimator with a small penalty is effective for random projection and screening, but the penalty value must be carefully selected. Unlike in linear regression, where penalties approaching zero work well, this approach leads to overfitting in non-Gaussian families. Instead, we recommend a data-driven method for penalty selection. This data-driven random projection improves prediction performance over conventional random projections, even surpassing benchmarks like elastic net. Furthermore, an ensemble of multiple such random projections combined with probabilistic variable screening delivers the best aggregated results in prediction and variable ranking across varying sparsity levels in simulations at a rather low computational cost. Finally, three applications with count and binary responses demonstrate the method's advantages in interpretability and prediction accuracy.
翻译:针对高维广义线性模型中预测变量高度相关且回归系数从稀疏到稠密分布的问题,本文提出一种数据驱动的随机投影方法。该方法特别适用于预测变量数量远多于观测样本量且底层结构(稀疏或稠密)未知的应用场景。我们通过采用岭型估计量进行变量筛选,并在降维过程中利用随机投影整合响应变量与预测变量关系的信息。研究表明,采用较小惩罚项的岭估计量能有效实现随机投影与筛选,但惩罚值必须谨慎选择。与线性回归中惩罚项趋近于零效果良好的情况不同,这种方法在非高斯族模型中会导致过拟合。因此,我们推荐采用数据驱动的方法选择惩罚项。这种数据驱动的随机投影在预测性能上优于传统随机投影方法,甚至超越了弹性网络等基准方法。此外,通过集成多个此类随机投影并结合概率变量筛选,在模拟实验中能以较低计算成本,在不同稀疏度水平下获得预测与变量排序的最佳聚合结果。最后,通过计数响应与二元响应的三个实际应用案例,验证了该方法在可解释性与预测准确性方面的优势。