Data sets tend to live in low-dimensional non-linear subspaces. Ideal data analysis tools for such data sets should therefore account for such non-linear geometry. The symmetric Riemannian geometry setting can be suitable for a variety of reasons. First, it comes with a rich mathematical structure to account for a wide range of non-linear geometries that has been shown to be able to capture the data geometry through empirical evidence from classical non-linear embedding. Second, many standard data analysis tools initially developed for data in Euclidean space can also be generalised efficiently to data on a symmetric Riemannian manifold. A conceptual challenge comes from the lack of guidelines for constructing a symmetric Riemannian structure on the data space itself and the lack of guidelines for modifying successful algorithms on symmetric Riemannian manifolds for data analysis to this setting. This work considers these challenges in the setting of pullback Riemannian geometry through a diffeomorphism. The first part of the paper characterises diffeomorphisms that result in proper, stable and efficient data analysis. The second part then uses these best practices to guide construction of such diffeomorphisms through deep learning. As a proof of concept, different types of pullback geometries -- among which the proposed construction -- are tested on several data analysis tasks and on several toy data sets. The numerical experiments confirm the predictions from theory, i.e., that the diffeomorphisms generating the pullback geometry need to map the data manifold into a geodesic subspace of the pulled back Riemannian manifold while preserving local isometry around the data manifold for proper, stable and efficient data analysis, and that pulling back positive curvature can be problematic in terms of stability.
翻译:数据集往往存在于低维非线性子空间中。针对此类数据的理想分析工具必须考虑这种非线性几何特性。对称黎曼几何框架因其多方面的优势而适用:首先,其丰富的数学结构能够描述广阔的非线性几何形态,经典非线性嵌入方法的实证证据已证明其可捕捉数据几何特征;其次,许多最初为欧氏空间数据开发的标准分析工具,可有效推广至对称黎曼流形上的数据处理。然而,存在两个概念性挑战:缺乏在数据空间本身构建对称黎曼结构的指导原则,以及缺乏将对称黎曼流形上成功的分析算法适配至实际数据场景的改造准则。本研究通过微分同胚的回归黎曼几何框架应对这些挑战。论文第一部分刻画了能实现稳定高效数据分析的微分同胚特征,第二部分则利用这些最佳实践指导深度学习驱动的微分同胚构建。作为概念验证,本文在多项数据分析任务及多个模拟数据集上测试了不同类型(包含本文提出的构建方法)的回归几何结构。数值实验验证了理论预测:生成回归几何的微分同胚需将数据流形映射至回归黎曼流形的测地子空间,同时在数据流形周围保持局部等距性,方能实现规范、稳定且高效的数据分析;而正曲率的回归映射在稳定性方面可能存在隐患。