Covariate shift and outcome model heterogeneity are two prominent challenges in leveraging external sources to improve risk modeling for underrepresented cohorts in paucity of accurate labels. We consider the transfer learning problem targeting some unlabeled minority sample encountering (i) covariate shift to the labeled source sample collected on a different cohort; and (ii) outcome model heterogeneity with some majority sample informative to the targeted minority model. In this scenario, we develop a novel model-assisted and knowledge-guided transfer learning targeting underrepresented population (MAKEUP) approach for high-dimensional regression models. Our MAKEUP approach includes a model-assisted debiasing step in response to the covariate shift, accompanied by a knowledge-guided sparsifying procedure leveraging the majority data to enhance learning on the minority group. We also develop a model selection method to avoid negative knowledge transfer that can work in the absence of gold standard labels on the target sample. Theoretical analyses show that MAKEUP provides efficient estimation for the target model on the minority group. It maintains robustness to the high complexity and misspecification of the nuisance models used for covariate shift correction, as well as adaptivity to the model heterogeneity and potential negative transfer between the majority and minority groups. Numerical studies demonstrate similar advantages in finite sample settings over existing approaches. We also illustrate our approach through a real-world application about the transfer learning of Type II diabetes genetic risk models on some underrepresented ancestry group.
翻译:协变量偏移与结局模型异质性是利用外部数据源提升代表性不足群体风险建模准确性的两大挑战,这些群体通常缺乏精确标签。本文研究针对未标记少数群体样本的迁移学习问题,该样本面临:(i) 与来自不同群体的已标记源样本存在协变量偏移;(ii) 与对目标少数群体模型具有信息价值的部分多数群体样本存在结局模型异质性。针对这一场景,我们提出了一种面向代表性不足人群的新型模型辅助与知识引导迁移学习(MAKEUP)方法,适用于高维回归模型。MAKEUP方法包含一个应对协变量偏移的模型辅助去偏步骤,以及一个利用多数群体数据进行知识引导的稀疏化过程,以增强对少数群体的学习。我们还开发了一种模型选择方法,可在目标样本缺乏金标准标签的情况下避免负向知识迁移。理论分析表明,MAKEUP能为少数群体的目标模型提供有效估计。该方法对用于协变量校正的干扰模型的高复杂性和误设具有鲁棒性,并能适应模型异质性以及多数群体与少数群体之间潜在的负向迁移。数值研究在有限样本条件下验证了该方法相较于现有方法的类似优势。我们通过一个真实世界应用案例进一步阐释了本方法:将II型糖尿病遗传风险模型迁移至某个代表性不足的祖先群体。