Building prediction models from mass-spectrometry data is challenging due to the abundance of correlated features with varying degrees of zero-inflation, leading to a common interest in reducing the features to a concise predictor set with good predictive performance. In this study, we formally established and examined regularized regression approaches, designed to address zero-inflated and correlated predictors. In particular, we describe a novel two-stage regularized regression approach (ridge-garrote) explicitly modelling zero-inflated predictors using two component variables, comprising a ridge estimator in the first stage and subsequently applying a nonnegative garrote estimator in the second stage. We contrasted ridge-garrote with one-stage methods (ridge, lasso) and other two-stage regularized regression approaches (lasso-ridge, ridge-lasso) for zero-inflated predictors. We assessed the predictive performance and predictor selection properties of these methods in a comparative simulation study and a real-data case study to predict kidney function using peptidomic features derived from mass-spectrometry. In the simulation study, the predictive performance of all assessed approaches was comparable, yet the ridge-garrote approach consistently selected more parsimonious models compared to its competitors in most scenarios. While lasso-ridge achieved higher predictive accuracy than its competitors, it exhibited high variability in the number of selected predictors. Ridge-lasso exhibited slightly superior predictive accuracy than ridge-garrote but at the expense of selecting more noise predictors. Overall, ridge emerged as a favourable option when variable selection is not a primary concern, while ridge-garrote demonstrated notable practical utility in selecting a parsimonious set of predictors, with only minimal compromise in predictive accuracy.
翻译:基于质谱数据构建预测模型具有挑战性,因为存在大量相关特征且零膨胀程度各异,导致普遍关注如何将特征精简为预测性能良好的简洁预测变量集。本研究正式建立并检验了旨在处理零膨胀及相关预测变量的正则化回归方法。具体而言,我们提出了一种新颖的两阶段正则化回归方法(岭-绞索法),该方法通过两个分量变量显式建模零膨胀预测变量:第一阶段使用岭估计量,第二阶段应用非负绞索估计量。我们将岭-绞索法与单阶段方法(岭回归、套索回归)及其他针对零膨胀预测变量的两阶段正则化回归方法(套索-岭回归、岭-套索回归)进行了对比。通过比较模拟研究和基于质谱肽组特征预测肾功能的真实数据案例研究,评估了这些方法的预测性能及预测变量选择特性。模拟研究中,所有评估方法的预测性能相当,但岭-绞索法在多数场景下始终能比竞争对手选择更简约的模型。尽管套索-岭回归的预测精度高于其他方法,但其选择预测变量的数量存在高度变异性。岭-套索回归的预测精度略优于岭-绞索法,但代价是选择了更多噪声预测变量。总体而言,当变量选择并非主要关注点时,岭回归成为优选方案;而岭-绞索法在平衡预测精度与模型简洁性方面展现出显著的实际应用价值。