Manifold Fitting: An Invitation to Statistics

While classical statistics has addressed observations that are real numbers or elements of a real vector space, at present many statistical problems of high interest in the sciences address the analysis of data that consist of more complex objects, taking values in spaces that are naturally not (Euclidean) vector spaces but which still feature some geometric structure. Manifold fitting is a long-standing problem, and has finally been addressed in recent years by Fefferman et al. (2020, 2021a). We develop a method with a theory guarantee that fits a $d$-dimensional underlying manifold from noisy observations sampled in the ambient space $\mathbb{R}^D$. The new approach uses geometric structures to obtain the manifold estimator in the form of image sets via a two-step mapping approach. We prove that, under certain mild assumptions and with a sample size $N=\mathcal{O}(\sigma^{(-d+3)})$, these estimators are true $d$-dimensional smooth manifolds whose estimation error, as measured by the Hausdorff distance, is bounded by $\mathcal{O}(\sigma^2\log(1/\sigma))$ with high probability. Compared with the existing approaches proposed in Fefferman et al. (2018, 2021b); Genovese et al. (2014); Yao and Xia (2019), our method exhibits superior efficiency while attaining very low error rates with a significantly reduced sample size, which scales polynomially in $\sigma^{-1}$ and exponentially in $d$. Extensive simulations are performed to validate our theoretical results. Our findings are relevant to various fields involving high-dimensional data in statistics and machine learning. Furthermore, our method opens up new avenues for existing non-Euclidean statistical methods in the sense that it has the potential to unify them to analyze data on manifolds in the ambience space domain.

翻译：流行拟合：一份邀请给统计学翻译后的摘要：在经典统计中，处理的观测值通常是实数或属于实向量空间的元素。但目前在科学领域中，许多高度关注的统计问题涉及到分析数据，这些数据由值不自然为欧几里得向量空间的复杂对象组成，但依然具有一些几何结构。流行拟合一直是存在已久的问题，最近由Fefferman等人（2020，2021a）解决。我们开发了一种方法，通过几何结构在形式上将$ d $维的真实流行拟合到从噪声中采样的环境空间$\mathbb{R}^D$观察结果中。新方法在形式上采用了两步映射方法，将流形估计器表示为图像集。我们证明，在某些温和的假设下，对于样本大小$N=\mathcal{O}(\sigma^{(-d+3)})$，这些估计器是真实的$ d $维平滑流形，其估计误差（由Hausdorff距离度量）高概率下保证在$\mathcal{O}(\sigma^2\log(1/\sigma))$的范围内。与Fefferman等人（2018, 2021b）；Genovese et al. (2014)；Yao和Xia（2019）提出的现有方法相比，我们的方法展示出更高的效率，同时通过显著降低样本量来达到非常低的误差率，在$\sigma^{-1}$中以多项式和$ d $的指数方式扩展样本量。通过广泛的模拟来验证我们的理论结果。我们的发现与各种涉及统计和机器学习中的高维数据的领域相关。此外，我们的方法为现有的非欧几里得统计方法开辟了新的途径，因为它具有统一分析流形数据的潜力。