AI-enabled precision medicine promises a transformational improvement in healthcare outcomes by enabling data-driven personalized diagnosis, prognosis, and treatment. However, the well-known "curse of dimensionality" and the clustered structure of biomedical data together interact to present a joint challenge in the high dimensional, limited observation precision medicine regime. To overcome both issues simultaneously we propose a simple and scalable approach to joint clustering and embedding that combines standard embedding methods with a convex clustering penalty in a modular way. This novel, cluster-aware embedding approach overcomes the complexity and limitations of current joint embedding and clustering methods, which we show with straightforward implementations of hierarchically clustered principal component analysis (PCA), locally linear embedding (LLE), and canonical correlation analysis (CCA). Through both numerical experiments and real-world examples, we demonstrate that our approach outperforms traditional and contemporary clustering methods on highly underdetermined problems (e.g., with just tens of observations) as well as on large sample datasets. Importantly, our approach does not require the user to choose the desired number of clusters, but instead yields interpretable dendrograms of hierarchically clustered embeddings. Thus our approach improves significantly on existing methods for identifying patient subgroups in multiomics and neuroimaging data, enabling scalable and interpretable biomarkers for precision medicine.
翻译:摘要:基于人工智能的精准医学通过数据驱动的个性化诊断、预后与治疗,有望在医疗结局方面实现变革性提升。然而,著名的"维度灾难"与生物医学数据的聚类结构共同作用,在精准医学中高维度、有限观测样本的背景下形成了联合挑战。为同时克服这两个问题,我们提出了一种简单且可扩展的联合聚类与嵌入方法,该方法以模块化方式将标准嵌入方法与凸聚类惩罚项相结合。这种新颖的聚类感知嵌入方法克服了现有联合嵌入与聚类方法的复杂性与局限性,我们通过分层聚类主成分分析(PCA)、局部线性嵌入(LLE)和典型相关分析(CCA)的直接实现进行了验证。通过数值实验和真实世界案例,我们证明了该方法在高度欠定问题(例如仅含数十个观测样本)以及大规模样本数据集上均优于传统和当代聚类方法。重要的是,我们的方法无需用户预先指定聚类数量,而是生成可解释的分层聚类嵌入树状图。因此,该方法显著改进了多组学与神经影像数据中患者亚组识别的现有技术,为精准医学提供了可扩展且可解释的生物标志物。