Robust and Efficient Semi-supervised Learning for Ising Model

In biomedical studies, it is often desirable to characterize the interactive mode of multiple disease outcomes beyond their marginal risk. Ising model is one of the most popular choices serving for this purpose. Nevertheless, learning efficiency of Ising models can be impeded by the scarcity of accurate disease labels, which is a prominent problem in contemporary studies driven by electronic health records (EHR). Semi-supervised learning (SSL) leverages the large unlabeled sample with auxiliary EHR features to assist the learning with labeled data only and is a potential solution to this issue. In this paper, we develop a novel SSL method for efficient inference of Ising model. Our method first models the outcomes against the auxiliary features, then uses it to project the score function of the supervised estimator onto the EHR features, and incorporates the unlabeled sample to augment the supervised estimator for variance reduction without introducing bias. For the key step of conditional modeling, we propose strategies that can effectively leverage the auxiliary EHR information while maintaining moderate model complexity. In addition, we introduce approaches including intrinsic efficient updates and ensemble, to overcome the potential misspecification of the conditional model that may cause efficiency loss. Our method is justified by asymptotic theory and shown to outperform existing SSL methods through simulation studies. We also illustrate its utility in a real example about several key phenotypes related to frequent ICU admission on MIMIC-III data set.

翻译：在生物医学研究中，除边际风险外，刻画多种疾病结局的交互模式通常具有重要价值。伊辛模型是实现该目标最广泛使用的工具之一。然而，电子健康记录（EHR）驱动的前沿研究中普遍存在的疾病标签稀缺问题，严重制约了伊辛模型的学习效率。半监督学习（SSL）通过利用大规模未标注样本及辅助EHR特征辅助仅有标签数据的学习，为应对该挑战提供了潜在解决方案。本文提出一种新型SSL方法以实现伊辛模型的高效推断。该方法首先建立结局变量与辅助特征的关联模型，继而将监督估计量的得分函数投影至EHR特征空间，并通过融合未标注样本增强监督估计量，在保持无偏性的同时实现方差缩减。针对关键的条件建模步骤，我们提出既能有效利用辅助EHR信息又能维持适中模型复杂度的策略。此外，我们引入包含内在高效更新与集成学习在内的技术方案，以克服条件模型可能存在的设定偏误导致效率损失的问题。所提方法经由渐近理论验证，并通过仿真研究证明其优于现有SSL方法。我们还在MIMIC-III数据集中，就与频繁ICU住院相关的若干关键表型实例展示了该方法的实用价值。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Nat. Biotechnol. | 机器学习为生物库驱动的药物发现提供动力

专知会员服务

11+阅读 · 2022年9月12日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日