We propose a spectral-based, unsupervised representation learning framework to derive low-dimensional embeddings for clinical concepts and patients in rare disease cohorts from electronic health records, where data are high-dimensional but sample sizes are limited. To overcome this challenge, we incorporate a knowledge matrix extracted from a broader population that shares a partially overlapping subspace with the rare-disease cohort. Our method departs from existing approaches by relaxing restrictive one-to-one signal-alignment assumptions between the latent data matrix and knowledge matrix, allowing more flexible and realistic forms of structured sharing. We introduce a novel two-step spectral embedding procedure: first, we identify and remove irrelevant components from the knowledge matrix; then, we apply a projection-based method to separately recover shared and heterogeneous components. Simulations and an analysis of a real-world multiple sclerosis cohort show that the proposed method outperforms competing approaches, particularly in challenging scenarios where shared signals are weak and only partially aligned, as is common in rare-disease data.
翻译:我们提出一种基于谱的无监督表示学习框架,用于从电子健康记录中为罕见疾病队列的临床概念和患者推导低维嵌入。在此场景中,数据呈现高维特性但样本量有限。为克服这一挑战,我们引入来自较大规模人群的知识矩阵,该人群与罕见疾病队列共享部分重叠子空间。本方法突破现有方法对潜在数据矩阵与知识矩阵之间严格一对一信号对齐假设的限制,允许更灵活、更现实的共享结构形式。我们提出一种新颖的两步谱嵌入流程:首先识别并移除知识矩阵中的无关成分,随后采用基于投影的方法分别恢复共享成分与异质成分。模拟实验与真实多发性硬化症队列分析表明,本方法在共享信号微弱且仅部分对齐(这在罕见疾病数据中常见)等具有挑战性的场景中,其性能显著优于现有方法。