Scalable manifold learning by uniform landmark sampling and constrained locally linear embedding

As a pivotal approach in machine learning and data science, manifold learning aims to uncover the intrinsic low-dimensional structure within complex nonlinear manifolds in high-dimensional space. By exploiting the manifold hypothesis, various techniques for nonlinear dimension reduction have been developed to facilitate visualization, classification, clustering, and gaining key insights. Although existing manifold learning methods have achieved remarkable successes, they still suffer from extensive distortions incurred in the global structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. Here, we propose a scalable manifold learning (scML) method that can manipulate large-scale and high-dimensional data in an efficient manner. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data, and then incorporates the non-landmarks into the learned space based on the constrained locally linear embedding (CLLE). We empirically validated the effectiveness of scML on synthetic datasets and real-world benchmarks of different types, and applied it to analyze the single-cell transcriptomics and detect anomalies in electrocardiogram (ECG) signals. scML scales well with increasing data sizes and embedding dimensions, and exhibits promising performance in preserving the global structure. The experiments demonstrate notable robustness in embedding quality as the sample rate decreases.

翻译：作为机器学习与数据科学中的核心方法，流形学习旨在揭示高维空间中复杂非线性流形的内在低维结构。通过利用流形假设，已发展出多种非线性降维技术以促进可视化、分类、聚类及关键见解的获取。尽管现有流形学习方法取得了显著成功，但其在全局结构上仍存在严重失真问题，这阻碍了对潜在模式的理解。此外，可扩展性问题也限制了其在大规模数据中的应用。本文提出一种可扩展流形学习方法（scML），能够高效处理大规模高维数据。该方法首先通过寻找一组地标点构建整个数据的低维骨架，随后基于约束局部线性嵌入（CLLE）将非地标点融入已学习到的空间中。我们通过不同类型的人工合成数据集和真实世界基准验证了scML的有效性，并将其应用于单细胞转录组学分析与心电信号异常检测。scML在数据规模与嵌入维度增大时仍保持良好扩展性，并在全局结构保持方面展现出优异性能。实验表明，随着采样率降低，其嵌入质量仍表现出显著的鲁棒性。

相关内容

流形学习

关注 345

流形学习，全称流形学习方法(Manifold Learning)，自2000年在著名的科学杂志《Science》被首次提出以来，已成为信息科学领域的研究热点。在理论和应用上，流形学习方法都具有重要的研究意义。假设数据是均匀采样于一个高维欧氏空间中的低维流形，流形学习就是从高维采样数据中恢复低维流形结构，即找到高维空间中的低维流形，并求出相应的嵌入映射，以实现维数约简或者数据可视化。它是从观测到的现象中去寻找事物的本质，找到产生数据的内在规律。

《图机器学习》课程

专知会员服务

49+阅读 · 2024年2月18日

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日