In biomedical studies, it is often desirable to characterize the interactive mode of multiple disease outcomes beyond their marginal risk. Ising model is one of the most popular choices serving for this purpose. Nevertheless, learning efficiency of Ising models can be impeded by the scarcity of accurate disease labels, which is a prominent problem in contemporary studies driven by electronic health records (EHR). Semi-supervised learning (SSL) leverages the large unlabeled sample with auxiliary EHR features to assist the learning with labeled data only and is a potential solution to this issue. In this paper, we develop a novel SSL method for efficient inference of Ising model. Our method first models the outcomes against the auxiliary features, then uses it to project the score function of the supervised estimator onto the EHR features, and incorporates the unlabeled sample to augment the supervised estimator for variance reduction without introducing bias. For the key step of conditional modeling, we propose strategies that can effectively leverage the auxiliary EHR information while maintaining moderate model complexity. In addition, we introduce approaches including intrinsic efficient updates and ensemble, to overcome the potential misspecification of the conditional model that may cause efficiency loss. Our method is justified by asymptotic theory and shown to outperform existing SSL methods through simulation studies. We also illustrate its utility in a real example about several key phenotypes related to frequent ICU admission on MIMIC-III data set.
翻译:在生物医学研究中,除边际风险外,刻画多种疾病结局的交互模式通常具有重要价值。伊辛模型是实现该目标最广泛使用的工具之一。然而,电子健康记录(EHR)驱动的前沿研究中普遍存在的疾病标签稀缺问题,严重制约了伊辛模型的学习效率。半监督学习(SSL)通过利用大规模未标注样本及辅助EHR特征辅助仅有标签数据的学习,为应对该挑战提供了潜在解决方案。本文提出一种新型SSL方法以实现伊辛模型的高效推断。该方法首先建立结局变量与辅助特征的关联模型,继而将监督估计量的得分函数投影至EHR特征空间,并通过融合未标注样本增强监督估计量,在保持无偏性的同时实现方差缩减。针对关键的条件建模步骤,我们提出既能有效利用辅助EHR信息又能维持适中模型复杂度的策略。此外,我们引入包含内在高效更新与集成学习在内的技术方案,以克服条件模型可能存在的设定偏误导致效率损失的问题。所提方法经由渐近理论验证,并通过仿真研究证明其优于现有SSL方法。我们还在MIMIC-III数据集中,就与频繁ICU住院相关的若干关键表型实例展示了该方法的实用价值。