As datasets used in scientific applications become more complex, studying the geometry and topology of data has become an increasingly prevalent part of the data analysis process. This can be seen for example with the growing interest in topological tools such as persistent homology. However, on the one hand, topological tools are inherently limited to providing only coarse information about the underlying space of the data. On the other hand, more geometric approaches rely predominately on the manifold hypothesis, which asserts that the underlying space is a smooth manifold. This assumption fails for many physical models where the underlying space contains singularities. In this paper we develop a machine learning pipeline that captures fine-grain geometric information without having to rely on any smoothness assumptions. Our approach involves working within the scope of algebraic geometry and algebraic varieties instead of differential geometry and smooth manifolds. In the setting of the variety hypothesis, the learning problem becomes to find the underlying variety using sample data. We cast this learning problem into a Maximum A Posteriori optimization problem which we solve in terms of an eigenvalue computation. Having found the underlying variety, we explore the use of Gr\"obner bases and numerical methods to reveal information about its geometry. In particular, we propose a heuristic for numerically detecting points lying near the singular locus of the underlying variety.
翻译:随着科学应用中使用的数据集日益复杂,研究数据的几何与拓扑结构已成为数据分析过程中越来越常见的环节。例如,人们对持续同调等拓扑工具的兴趣日益增长便体现了这一趋势。然而,一方面拓扑工具本质上仅限于提供数据底层空间的粗略信息,另一方面,更多几何方法主要依赖流形假设——即假设底层空间为光滑流形。这一假设对于许多物理模型(底层空间包含奇点)并不成立。本文开发了一种无需依赖光滑性假设即可捕获细粒度几何信息的机器学习流程。我们的方法是在代数几何与代数簇的范畴内开展工作,而非微分几何与光滑流形。在簇假设的设定下,学习问题转化为利用样本数据寻找底层代数簇。我们将该学习问题表述为最大后验概率优化问题,并通过特征值计算求解。在找到底层代数簇后,我们探索了利用格罗布纳基与数值方法揭示其几何信息的途径。特别地,我们提出了一种启发式算法,用于数值检测靠近底层代数簇奇异轨迹的数据点。