As a pivotal approach in machine learning and data science, manifold learning aims to uncover the intrinsic low-dimensional structure within complex nonlinear manifolds in high-dimensional space. By exploiting the manifold hypothesis, various techniques for nonlinear dimension reduction have been developed to facilitate visualization, classification, clustering, and gaining key insights. Although existing manifold learning methods have achieved remarkable successes, they still suffer from extensive distortions incurred in the global structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. Here, we propose a scalable manifold learning (scML) method that can manipulate large-scale and high-dimensional data in an efficient manner. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data, and then incorporates the non-landmarks into the learned space based on the constrained locally linear embedding (CLLE). We empirically validated the effectiveness of scML on synthetic datasets and real-world benchmarks of different types, and applied it to analyze the single-cell transcriptomics and detect anomalies in electrocardiogram (ECG) signals. scML scales well with increasing data sizes and embedding dimensions, and exhibits promising performance in preserving the global structure. The experiments demonstrate notable robustness in embedding quality as the sample rate decreases.
翻译:作为机器学习与数据科学中的核心方法,流形学习旨在揭示高维空间中复杂非线性流形的内在低维结构。通过利用流形假设,已发展出多种非线性降维技术以促进可视化、分类、聚类及关键见解的获取。尽管现有流形学习方法取得了显著成功,但其在全局结构上仍存在严重失真问题,这阻碍了对潜在模式的理解。此外,可扩展性问题也限制了其在大规模数据中的应用。本文提出一种可扩展流形学习方法(scML),能够高效处理大规模高维数据。该方法首先通过寻找一组地标点构建整个数据的低维骨架,随后基于约束局部线性嵌入(CLLE)将非地标点融入已学习到的空间中。我们通过不同类型的人工合成数据集和真实世界基准验证了scML的有效性,并将其应用于单细胞转录组学分析与心电信号异常检测。scML在数据规模与嵌入维度增大时仍保持良好扩展性,并在全局结构保持方面展现出优异性能。实验表明,随着采样率降低,其嵌入质量仍表现出显著的鲁棒性。