Bi-stochastic normalization provides an alternative normalization of graph Laplacians in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations. This paper proves the convergence of bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates, when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of $n \to \infty$ and kernel bandwidth $\epsilon \to 0$, the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be $ O( n^{-1/(d/2+3)})$ at finite large $n$ up to log factors, achieved at the scaling of $\epsilon \sim n^{-1/(d/2+3)} $. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data plus an additional term proportional to the boundedness of the inner-products of the noise vectors among themselves and with data vectors. Motivated by our analysis, which suggests that not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to high-dimensional outlier noise.
翻译:双随机规范化提供了图数据分析中图拉普拉斯的一种替代规范化方法,并可通过Sinkhorn-Knopp (SK)迭代高效计算。本文证明了当$n$个数据点从嵌入高维空间的一般$d$维流形中独立同分布采样时,双随机规范化图拉普拉斯以一定速率收敛到流形(加权)拉普拉斯算子。在$n \to \infty$与核带宽$\epsilon \to 0$的联合极限下,证明图拉普拉斯算子(按2-范数)的点态收敛速率为$O(n^{-1/(d/2+3)})$(针对有限大$n$,忽略对数因子),该速率在$\epsilon \sim n^{-1/(d/2+3)}$的尺度下达到。当流形数据被离群噪声污染时,我们理论证明了图拉普拉斯的点态一致性,该一致性在干净流形数据的速率基础上,额外增加一项与噪声向量之间及其与数据向量内积有界性成比例的项。受我们的分析(表明精确双随机规范化不必需,而近似规范化即可达到相同一致性速率)启发,我们提出一个可通过提前终止SK迭代求解的近似约束矩阵缩放问题。数值实验支持我们的理论结果,并展示了双随机规范化图拉普拉斯对高维离群噪声的鲁棒性。