Electronic health records (EHR) pose large-scale multi-disease modeling problems in which many outcomes are rare and strongly influenced by shared risk factors. While modern approaches achieve strong predictive performance, they often treat diseases independently or rely on black-box architectures, offering limited insight into how risk factors organize disease risk and little principled uncertainty quantification. We introduce a Bayesian hypergraph inference framework that reframes multi-disease modeling around latent, risk-factor-modulated disease pathways. Risk factors act on hyperedges, latent disease subsets with shared risk patterns, allowing diseases to participate in multiple distinct pathways and enabling interpretable, higher-order structure beyond pairwise associations. A repulsion prior encourages parsimonious and identifiable structure, while posterior inference provides calibrated uncertainty over both disease groupings and risk-factor influence. To enable scalable inference on large EHR datasets, we develop a structured variational inference algorithm that preserves logical dependencies among hyperedge existence, disease membership, and pathway-level effects. Experiments on simulated data and UK Biobank demonstrate stable and interpretable disease pathway structure, well-calibrated uncertainty, improved estimation for rare diseases, and competitive predictive performance.
翻译:电子健康记录(EHR)构成了大规模多疾病建模问题,其中许多结局为罕见事件,且受共享风险因素强烈影响。尽管现代方法已取得卓越预测性能,但它们通常将疾病视为独立实体或依赖黑箱架构,对风险因素如何组织疾病风险的洞察有限,且缺乏原则性的不确定性量化。我们提出一种贝叶斯超图推断框架,将以潜在的风险因素调控疾病通路为核心重构多疾病建模。风险因素作用于超边(具有共享风险模式的潜在疾病子集),使疾病可参与多个不同通路,并实现超越成对关联的可解释高阶结构。排斥先验促进结构简洁性与可识别性,后验推断则提供对疾病分组和风险因素影响的双重校准不确定性。为支持大规模EHR数据集的可扩展推断,我们开发了结构化变分推断算法,该算法保留了超边存在性、疾病隶属关系和通路层级效应之间的逻辑依赖性。在模拟数据与英国生物样本库上的实验表明,该方法可获得稳定且可解释的疾病通路结构、良好校准的不确定性、对罕见病估计的改进以及具有竞争力的预测性能。