Efficient Modeling of Surrogates to Improve Multi-source High-dimensional Biobank Studies

Surrogate variables in electronic health records (EHR) and biobank data play an important role in biomedical studies due to the scarcity or absence of chart-reviewed gold standard labels. We develop a novel approach named SASH for {\bf S}urrogate-{\bf A}ssisted and data-{\bf S}hielding {\bf H}igh-dimensional integrative regression. It is a semi-supervised approach that efficiently leverages sizable unlabeled samples with error-prone EHR surrogate outcomes from multiple local sites, to improve the learning accuracy of the small gold-labeled data. {To facilitate stable and efficient knowledge extraction from the surrogates, our method first obtains a preliminary supervised estimator, and then uses it to assist training a regularized single index model (SIM) for the surrogates. Interestingly, through a chain of convex and properly penalized sparse regressions that approximate the SIM loss with bias-correction, our method avoids the local minima issue of the SIM training, and fully eliminates the impact of the preliminary estimator's large error. In addition, it protects individual-level information through summary-statistics-based data aggregation across the local sites, leveraging a similar idea of bias-corrected approximation for SIM.} Through simulation studies, we demonstrate that our method outperforms existing approaches on finite samples. Finally, we apply our method to develop a high dimensional genetic risk model for type II diabetes using large-scale data sets from UK and Mass General Brigham biobanks, where only a small fraction of subjects in one site has been labeled via chart reviewing.

翻译：电子健康记录（EHR）和生物库数据中的替代变量，由于图表审查金标准标签的稀缺或缺失，在生物医学研究中发挥着重要作用。我们开发了一种名为SASH的新方法，用于替代变量辅助与数据屏蔽的高维集成回归。这是一种半监督方法，能够高效利用来自多个本地站点的大量带有误差的EHR替代变量未标记样本，以提升少量金标准标签数据的学习精度。为促进从替代变量中稳定且高效地提取知识，该方法首先获取初步监督估计量，随后利用其辅助训练一个正则化的单指标模型（SIM）以拟合替代变量。有趣的是，通过一系列凸性且适当惩罚的稀疏回归（这些回归以偏差校正方式近似SIM损失），该方法避免了SIM训练的局部最小值问题，并完全消除了初步估计量较大误差的影响。此外，它通过基于汇总统计的跨局部站点数据聚合来保护个体层面信息，借鉴了类似SIM的偏差校正近似思想。通过模拟研究，我们证明该方法在有限样本下优于现有方法。最后，我们将该方法应用于利用英国和马萨诸塞州总布里格姆生物库的大规模数据集构建II型糖尿病高维遗传风险模型，其中仅一个站点的少部分受试者通过图表审查进行了标签标注。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/