Handling high-dimensional datasets presents substantial computational challenges, particularly when the number of features far exceeds the number of observations and when features are highly correlated. A modern approach to mitigate these issues is feature screening. In this work, the High-dimensional Ordinary Least-squares Projection (HOLP) feature screening method is advanced by employing adaptive ridge regularization. The impact of the ridge penalty on the Ridge-HOLP method is examined and Air-HOLP is proposed, a data-adaptive advance to Ridge-HOLP where the ridge-regularization parameter is selected iteratively and optimally for better feature screening performance. The proposed method addresses the challenges of penalty selection in high dimensions by offering a computationally efficient and stable alternative to traditional methods like bootstrapping and cross-validation. Air-HOLP is evaluated using simulated data and a prostate cancer genetic dataset. The empirical results demonstrate that Air-HOLP has improved performance over a large range of simulation settings. We provide R codes implementing the Air-HOLP feature screening method and integrating it into existing feature screening methods that utilize the HOLP formula.
翻译:处理高维数据集带来了巨大的计算挑战,尤其是当特征数量远超过观测样本数量且特征间存在高度相关性时。特征筛选是现代缓解此类问题的主流方法。本研究通过采用自适应岭正则化改进了高维普通最小二乘投影(HOLP)特征筛选方法。本文系统分析了岭惩罚项对Ridge-HOLP方法的影响,进而提出Air-HOLP方法——这是对Ridge-HOLP的数据自适应改进,通过迭代优化选择岭正则化参数以提升特征筛选性能。该方法通过提供计算高效且稳定的解决方案,有效应对了高维数据中惩罚参数选择的难题,其性能优于传统的自助法和交叉验证方法。通过模拟数据和前列腺癌基因数据集对Air-HOLP进行评估,实证结果表明该方法在广泛的模拟场景中均表现出优越性能。我们提供了实现Air-HOLP特征筛选方法的R代码,并将其集成到现有基于HOLP框架的特征筛选方法体系中。