We propose a kernel-spectral embedding algorithm for learning low-dimensional nonlinear structures from high-dimensional and noisy observations, where the datasets are assumed to be sampled from an intrinsically low-dimensional manifold and corrupted by high-dimensional noise. The algorithm employs an adaptive bandwidth selection procedure which does not rely on prior knowledge of the underlying manifold. The obtained low-dimensional embeddings can be further utilized for downstream purposes such as data visualization, clustering and prediction. Our method is theoretically justified and practically interpretable. Specifically, we establish the convergence of the final embeddings to their noiseless counterparts when the dimension and size of the samples are comparably large, and characterize the effect of the signal-to-noise ratio on the rate of convergence and phase transition. We also prove convergence of the embeddings to the eigenfunctions of an integral operator defined by the kernel map of some reproducing kernel Hilbert space capturing the underlying nonlinear structures. Numerical simulations and analysis of three real datasets show the superior empirical performance of the proposed method, compared to many existing methods, on learning various manifolds in diverse applications.
翻译:我们提出一种核谱嵌入算法,用于从高维含噪观测中学习低维非线性结构。该算法假设数据集采样自内在低维流形,并受到高维噪声污染。算法采用无需依赖底层流形先验知识的自适应带宽选择过程,所得低维嵌入可进一步用于数据可视化、聚类和预测等下游任务。本方法具有理论严谨性和实践可解释性:具体而言,当样本维度和规模足够大时,我们证明了最终嵌入向无噪声嵌入的收敛性,刻画了信噪比对收敛速率和相变的影响;同时证明了嵌入收敛到再生核希尔伯特空间核映射定义的积分算子特征函数,该空间捕获了潜在非线性结构。数值模拟与三个真实数据集的分析表明,与现有方法相比,本方法在多种应用场景下学习不同类型流形时具有更优的实证性能。