Manifold learning is a central task in modern statistics and data science. Many datasets (cells, documents, images, molecules) can be represented as point clouds embedded in a high dimensional ambient space, however the degrees of freedom intrinsic to the data are usually far fewer than the number of ambient dimensions. The task of detecting a latent manifold along which the data are embedded is a prerequisite for a wide family of downstream analyses. Real-world datasets are subject to noisy observations and sampling, so that distilling information about the underlying manifold is a major challenge. We propose a method for manifold learning that utilises a symmetric version of optimal transport with a quadratic regularisation that constructs a sparse and adaptive affinity matrix, that can be interpreted as a generalisation of the bistochastic kernel normalisation. We prove that the resulting kernel is consistent with a Laplace-type operator in the continuous limit, establish robustness to heteroskedastic noise and exhibit these results in simulations. We identify a highly efficient computational scheme for computing this optimal transport for discrete data and demonstrate that it outperforms competing methods in a set of examples.
翻译:流形学习是现代统计学和数据科学中的核心任务。许多数据集(细胞、文档、图像、分子)可表示为嵌入高维空间中的点云,然而数据内在的自由度通常远少于环境维数。检测数据沿其嵌入的潜在流形是广泛下游分析的前提。真实数据集受噪声观测和采样影响,因此提取底层流形的信息是一项重大挑战。我们提出一种流形学习方法,利用对称版本的二次正则化最优输运,构建稀疏且自适应的亲和矩阵,该矩阵可解释为双随机核归一化的推广。我们证明所得核在连续极限下与拉普拉斯型算子一致,建立对异方差噪声的鲁棒性,并通过模拟展示这些结果。我们提出一种高效计算方案,用于计算离散数据的最优输运,并在示例集上证明其优于竞争方法。