The Gaussian kernel and its traditional normalizations (e.g., row-stochastic) are popular approaches for assessing similarities between data points. Yet, they can be inaccurate under high-dimensional noise, especially if the noise magnitude varies considerably across the data, e.g., under heteroskedasticity or outliers. In this work, we investigate a more robust alternative -- the doubly stochastic normalization of the Gaussian kernel. We consider a setting where points are sampled from an unknown density on a low-dimensional manifold embedded in high-dimensional space and corrupted by possibly strong, non-identically distributed, sub-Gaussian noise. We establish that the doubly stochastic affinity matrix and its scaling factors concentrate around certain population forms, and provide corresponding finite-sample probabilistic error bounds. We then utilize these results to develop several tools for robust inference under general high-dimensional noise. First, we derive a robust density estimator that reliably infers the underlying sampling density and can substantially outperform the standard kernel density estimator under heteroskedasticity and outliers. Second, we obtain estimators for the pointwise noise magnitudes, the pointwise signal magnitudes, and the pairwise Euclidean distances between clean data points. Lastly, we derive robust graph Laplacian normalizations that accurately approximate various manifold Laplacians, including the Laplace Beltrami operator, improving over traditional normalizations in noisy settings. We exemplify our results in simulations and on real single-cell RNA-sequencing data. For the latter, we show that in contrast to traditional methods, our approach is robust to variability in technical noise levels across cell types.
翻译:高斯核及其传统归一化方法(如行随机)是评估数据点间相似性的常用方法。然而,在高维噪声环境下——尤其是当噪声幅度在数据间存在显著差异时(例如异方差性或异常值情形),这些方法可能产生不准确的结果。本研究探索了一种更具鲁棒性的替代方案——高斯核的双重随机归一化。我们考虑以下场景:数据点采样自嵌入高维空间的低维流形上未知密度分布,并可能被非独立同分布的强亚高斯噪声所污染。研究证明,双重随机亲和矩阵及其缩放因子会围绕特定总体形式聚集,并建立了相应的有限样本概率误差界。基于这些结果,我们开发了多种用于高维噪声环境下鲁棒推断的工具:第一,构建了能可靠推断潜在采样密度的鲁棒密度估计器,在异方差性和异常值存在时显著优于标准核密度估计器;第二,获得了逐点噪声幅度、逐点信号幅度以及干净数据点间欧氏距离的估计量;第三,推导了能精确逼近多种流形拉普拉斯算子(包括Laplace-Beltrami算子)的鲁棒图拉普拉斯归一化方法,在噪声环境中优于传统归一化方法。我们通过仿真实验及真实单细胞RNA测序数据验证了这些成果。针对后者,我们证明与经典方法相比,本方法对不同细胞类型间技术噪声水平的变异性具有鲁棒性。