As a pivotal approach in machine learning and data science, manifold learning aims to uncover the intrinsic low-dimensional structure within complex nonlinear manifolds in high-dimensional space. By exploiting the manifold hypothesis, various techniques for nonlinear dimension reduction have been developed to facilitate visualization, classification, clustering, and gaining key insights. Although existing manifold learning methods have achieved remarkable successes, they still suffer from extensive distortions incurred in the global structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. Here, we propose a scalable manifold learning (scML) method that can manipulate large-scale and high-dimensional data in an efficient manner. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data and then incorporates the non-landmarks into the landmark space based on the constrained locally linear embedding (CLLE). We empirically validated the effectiveness of scML on synthetic datasets and real-world benchmarks of different types, and applied it to analyze the single-cell transcriptomics and detect anomalies in electrocardiogram (ECG) signals. scML scales well with increasing data sizes and exhibits promising performance in preserving the global structure. The experiments demonstrate notable robustness in embedding quality as the sample rate decreases.
翻译:作为机器学习与数据科学中的关键方法,流形学习旨在揭示高维空间中复杂非线性流形的内在低维结构。通过利用流形假设,研究人员开发了多种非线性降维技术以促进数据可视化、分类、聚类及关键洞察的获取。尽管现有流形学习方法已取得显著成功,其在全局结构上仍存在严重扭曲问题,这阻碍了对潜在模式的深入理解。同时,可扩展性问题也限制了这些方法在大规模数据中的应用。本文提出一种可扩展流形学习方法(scML),能够高效处理大规模高维数据。该方法首先通过选择一组地标点构建整个数据的低维骨架,随后基于约束局部线性嵌入(CLLE)将非地标点嵌入到地标空间中。我们通过合成数据集及多种类型的真实世界基准验证了scML的有效性,并将其应用于单细胞转录组学分析与心电图(ECG)信号异常检测。scML随数据量增长展现出良好的可扩展性,并在保持全局结构方面表现出优异性能。实验表明,随着采样率降低,该方法的嵌入质量仍保持显著鲁棒性。