Consider a set of points sampled independently near a smooth compact submanifold of Euclidean space. We provide mathematically rigorous bounds on the number of sample points required to estimate both the dimension and the tangent spaces of that manifold with high confidence. The algorithm for this estimation is Local PCA, a local version of principal component analysis. Our results accommodate for noisy non-uniform data distribution with the noise that may vary across the manifold, and allow simultaneous estimation at multiple points. Crucially, all of the constants appearing in our bound are explicitly described. The proof uses a matrix concentration inequality to estimate covariance matrices and a Wasserstein distance bound for quantifying nonlinearity of the underlying manifold and non-uniformity of the probability measure.
翻译:考虑欧氏空间中光滑紧致子流形附近独立采样的点集。本文为高置信度估计该流形的维数及切空间所需的样本点数量提供了严谨的数学界值。该估计算法采用局部主成分分析(Local PCA),即主成分分析的局部化版本。我们的结论适用于含噪声的非均匀数据分布(噪声可能沿流形变化),并支持在多个点同时进行估计。关键在于,界值中出现的所有常数均被显式描述。证明过程采用矩阵浓度不等式估计协方差矩阵,并利用Wasserstein距离界值量化底层流形的非线性特性及概率测度的非均匀性。