Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions remain underdiagnosed and poorly characterized due to their low prevalence and limited clinician familiarity. Computational phenotyping offers a scalable approach to improving rare disease detection, but algorithm development is hindered by the scarcity of high-quality labeled data for training. Expert-labeled datasets from chart reviews and registries are clinically accurate but limited in scope and availability, whereas labels derived from electronic health records (EHRs) provide broader coverage but are often noisy or incomplete. To address these challenges, we propose WEST (WEakly Supervised Transformer for rare disease phenotyping and subphenotyping from EHRs), a framework that combines routinely collected EHR data with a limited set of expert-validated cases and controls to enable large-scale phenotyping. At its core, WEST employs a weakly supervised transformer model trained on extensive probabilistic silver-standard labels - derived from both structured and unstructured EHR features - that are iteratively refined during training to improve model calibration. We evaluate WEST on two rare pulmonary diseases using EHR data from Boston Children's Hospital and show that it outperforms existing methods in phenotype classification, identification of clinically meaningful subphenotypes, and prediction of disease progression. By reducing reliance on manual annotation, WEST enables data-efficient rare disease phenotyping that improves cohort definition, supports earlier and more accurate diagnosis, and accelerates data-driven discovery for the rare disease community.
翻译:罕见疾病影响全球约3-4亿人口,但由于其低发病率及临床医生认知有限,个体病症常面临诊断不足与特征描述不充分的问题。计算表型分析为提升罕见疾病检测能力提供了可扩展的途径,但高质量标注训练数据的稀缺制约了算法发展。通过病历审查和注册系统获得的专家标注数据集虽临床准确性高,但其覆盖范围和可用性有限;而源自电子健康记录(EHRs)的标注虽覆盖更广,却常存在噪声或信息不完整的问题。为应对这些挑战,我们提出WEST(基于EHR的罕见病表型与亚型分析弱监督Transformer框架),该框架将常规收集的EHR数据与有限专家验证病例及对照组相结合,实现大规模表型分析。WEST的核心是采用弱监督Transformer模型,该模型基于从结构化与非结构化EHR特征中提取的广泛概率性银标准标签进行训练,并通过训练过程中的迭代优化提升模型校准能力。我们利用波士顿儿童医院的EHR数据对两种罕见肺部疾病进行评估,结果表明WEST在表型分类、临床意义亚型识别及疾病进展预测方面均优于现有方法。通过降低对人工标注的依赖,WEST实现了数据高效的罕见病表型分析,能够优化队列定义、支持更早期精准的诊断,并为罕见病领域的数据驱动发现提供加速动力。