Ridge regression with random coefficients provides an important alternative to fixed coefficients regression in high dimensional setting when the effects are expected to be small but not zeros. This paper considers estimation and prediction of random coefficient ridge regression in the setting of transfer learning, where in addition to observations from the target model, source samples from different but possibly related regression models are available. The informativeness of the source model to the target model can be quantified by the correlation between the regression coefficients. This paper proposes two estimators of regression coefficients of the target model as the weighted sum of the ridge estimates of both target and source models, where the weights can be determined by minimizing the empirical estimation risk or prediction risk. Using random matrix theory, the limiting values of the optimal weights are derived under the setting when $p/n \rightarrow \gamma$, where $p$ is the number of the predictors and $n$ is the sample size, which leads to an explicit expression of the estimation or prediction risks. Simulations show that these limiting risks agree very well with the empirical risks. An application to predicting the polygenic risk scores for lipid traits shows such transfer learning methods lead to smaller prediction errors than the single sample ridge regression or Lasso-based transfer learning.
翻译:在高维设定下,当效应预期较小但不为零时,具有随机系数的岭回归为固定系数回归提供了重要替代方案。本文研究迁移学习背景下随机系数岭回归的估计与预测问题——除目标模型的观测数据外,还可获取来自不同但可能相关的回归模型的源样本。源模型对目标模型的信息量可通过回归系数间的相关性进行量化。本文提出两种目标模型回归系数的估计量,即目标模型与源模型岭估计的加权和,其中权重可通过最小化经验估计风险或预测风险确定。利用随机矩阵理论,在 $p/n \rightarrow \gamma$ 的设定下(其中 $p$ 为预测变量数,$n$ 为样本量)推导出最优权重的极限值,从而得到估计风险或预测风险的显式表达式。模拟结果表明,这些极限风险与经验风险高度吻合。应用于血脂性状多基因风险评分预测的实例表明,此类迁移学习方法比单样本岭回归或基于Lasso的迁移学习产生更小的预测误差。