Scalable manifold learning by uniform landmark sampling and constrained locally linear embedding

As a pivotal approach in machine learning and data science, manifold learning aims to uncover the intrinsic low-dimensional structure within complex nonlinear manifolds in high-dimensional space. By exploiting the manifold hypothesis, various techniques for nonlinear dimension reduction have been developed to facilitate visualization, classification, clustering, and gaining key insights. Although existing manifold learning methods have achieved remarkable successes, they still suffer from extensive distortions incurred in the global structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. Here, we propose a scalable manifold learning (scML) method that can manipulate large-scale and high-dimensional data in an efficient manner. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data and then incorporates the non-landmarks into the landmark space based on the constrained locally linear embedding (CLLE). We empirically validated the effectiveness of scML on synthetic datasets and real-world benchmarks of different types, and applied it to analyze the single-cell transcriptomics and detect anomalies in electrocardiogram (ECG) signals. scML scales well with increasing data sizes and exhibits promising performance in preserving the global structure. The experiments demonstrate notable robustness in embedding quality as the sample rate decreases.

翻译：作为机器学习与数据科学中的关键方法，流形学习旨在揭示高维空间中复杂非线性流形的内在低维结构。通过利用流形假设，研究人员开发了多种非线性降维技术以促进数据可视化、分类、聚类及关键洞察的获取。尽管现有流形学习方法已取得显著成功，其在全局结构上仍存在严重扭曲问题，这阻碍了对潜在模式的深入理解。同时，可扩展性问题也限制了这些方法在大规模数据中的应用。本文提出一种可扩展流形学习方法（scML），能够高效处理大规模高维数据。该方法首先通过选择一组地标点构建整个数据的低维骨架，随后基于约束局部线性嵌入（CLLE）将非地标点嵌入到地标空间中。我们通过合成数据集及多种类型的真实世界基准验证了scML的有效性，并将其应用于单细胞转录组学分析与心电图（ECG）信号异常检测。scML随数据量增长展现出良好的可扩展性，并在保持全局结构方面表现出优异性能。实验表明，随着采样率降低，该方法的嵌入质量仍保持显著鲁棒性。

相关内容

流形学习

关注 345

流形学习，全称流形学习方法(Manifold Learning)，自2000年在著名的科学杂志《Science》被首次提出以来，已成为信息科学领域的研究热点。在理论和应用上，流形学习方法都具有重要的研究意义。假设数据是均匀采样于一个高维欧氏空间中的低维流形，流形学习就是从高维采样数据中恢复低维流形结构，即找到高维空间中的低维流形，并求出相应的嵌入映射，以实现维数约简或者数据可视化。它是从观测到的现象中去寻找事物的本质，找到产生数据的内在规律。

《图机器学习》课程

专知会员服务

49+阅读 · 2024年2月18日

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日