Development of comprehensive prediction models are often of great interest in many disciplines of science, but datasets with information on all desired features typically have small sample sizes. In this article, we describe a transfer learning approach for building high-dimensional generalized linear models using data from a main study that has detailed information on all predictors, and from one or more external studies that have ascertained a more limited set of predictors. We propose using the external dataset(s) to build reduced model(s) and then transfer the information on underlying parameters for the analysis of the main study through a set of calibration equations, while accounting for the study-specific effects of certain design variables. We then use a generalized method of moment (GMM) with penalization for parameter estimation and develop highly scalable algorithms for fitting models taking advantage of the popular glmnet package. We further show that the use of adaptive-Lasso penalty leads to the oracle property of underlying parameter estimates and thus leads to convenient post-selection inference procedures. We conduct extensive simulation studies to investigate both predictive performance and post-selection inference properties of the proposed method. Finally, we illustrate a timely application of the proposed method for the development of risk prediction models for five common diseases using the UK Biobank study, combining baseline information from all study participants (500K) and recently released high-throughout proteomic data (# protein = 1500) on a subset (50K) of the participants.
翻译:在众多科学领域中,开发综合性预测模型往往备受关注,但包含所有目标特征信息的数据集通常样本量较小。本文提出一种迁移学习方法,利用主研究中包含所有预测变量详细信息的数据库,以及一个或多个仅获取有限预测变量集的外部研究数据,构建高维广义线性模型。我们建议使用外部数据集建立简化模型,然后通过一组校准方程将底层参数信息迁移至主研究分析,同时考虑特定设计变量对研究效应的影响。随后采用带惩罚项的广义矩估计法(GMM)进行参数估计,并借助流行的glmnet包开发高度可扩展的模型拟合算法。进一步证明,使用自适应-Lasso惩罚项可得到底层参数估计的oracle性质,从而促进便捷的选择后推断流程。我们通过大量仿真研究验证所提方法的预测性能与选择后推断特性。最后,以英国生物银行(UK Biobank)研究为实例,整合所有研究参与者(50万例)的基线信息与其中子集(5万例)最新发布的高通量蛋白质组学数据(蛋白质种类=1500),展示了该方法在五种常见疾病风险预测模型开发中的及时应用。