In this work, we propose a Semi-supervised Triply Robust Inductive transFer LEarning (STRIFLE) approach, which integrates heterogeneous data from a label-rich source population and a label-scarce target population and utilizes a large amount of unlabeled data simultaneously to improve the learning accuracy in the target population. Specifically, we consider a high dimensional covariate shift setting and employ two nuisance models, a density ratio model and an imputation model, to combine transfer learning and surrogate-assisted semi-supervised learning strategies effectively and achieve triple robustness. While the STRIFLE approach assumes the target and source populations to share the same conditional distribution of outcome Y given both the surrogate features S and predictors X, it allows the true underlying model of Y|X to differ between the two populations due to the potential covariate shift in S and X. Different from double robustness, even if both nuisance models are misspecified or the distribution of Y|(S, X) is not the same between the two populations, the triply robust STRIFLE estimator can still partially use the source population when the shifted source population and the target population share enough similarities. Moreover, it is guaranteed to be no worse than the target-only surrogate-assisted semi-supervised estimator with an additional error term from transferability detection. These desirable properties of our estimator are established theoretically and verified in finite samples via extensive simulation studies. We utilize the STRIFLE estimator to train a Type II diabetes polygenic risk prediction model for the African American target population by transferring knowledge from electronic health records linked genomic data observed in a larger European source population.
翻译:本文提出了一种半监督三重鲁棒归纳迁移学习(STRIFLE)方法,该方法整合了来自标签丰富的源群体和标签稀缺的目标群体的异构数据,并同时利用大量未标记数据来提高目标群体的学习准确性。具体而言,我们考虑高维协变量偏移设置,并采用两个干扰模型——密度比模型和插补模型——以有效结合迁移学习和代理辅助半监督学习策略,实现三重鲁棒性。虽然STRIFLE方法假设目标群体和源群体在给定代理特征S和预测变量X的条件下共享相同的结局Y的条件分布,但由于S和X可能存在协变量偏移,它允许Y|X的真实底层模型在两个群体间存在差异。与双重鲁棒性不同,即使两个干扰模型均被错误设定,或Y|(S, X)的分布在两个群体间不完全相同,当偏移后的源群体与目标群体具有足够相似性时,三重鲁棒的STRIFLE估计量仍能部分利用源群体信息。此外,理论上保证该估计量不劣于仅使用目标数据的代理辅助半监督估计量,仅额外增加一项可迁移性检测带来的误差项。我们通过理论分析确立了估计量的这些优良性质,并通过大量模拟研究在有限样本中进行了验证。我们应用STRIFLE估计量,通过从欧洲源群体中观测到的更大规模电子健康记录关联基因组数据迁移知识,为非洲裔美国人目标群体训练了II型糖尿病多基因风险预测模型。