As datasets used in scientific applications become more complex, studying the geometry and topology of data has become an increasingly prevalent part of the data analysis process. This can be seen for example with the growing interest in topological tools such as persistent homology. However, on the one hand, topological tools are inherently limited to providing only coarse information about the underlying space of the data. On the other hand, more geometric approaches rely predominately on the manifold hypothesis, which asserts that the underlying space is a smooth manifold. This assumption fails for many physical models where the underlying space contains singularities. In this paper we develop a machine learning pipeline that captures fine-grain geometric information without having to rely on any smoothness assumptions. Our approach involves working within the scope of algebraic geometry and algebraic varieties instead of differential geometry and smooth manifolds. In the setting of the variety hypothesis, the learning problem becomes to find the underlying variety using sample data. We cast this learning problem into a Maximum A Posteriori optimization problem which we solve in terms of an eigenvalue computation. Having found the underlying variety, we explore the use of Gr\"obner bases and numerical methods to reveal information about its geometry. In particular, we propose a heuristic for numerically detecting points lying near the singular locus of the underlying variety.
翻译:随着科学应用中使用的数据集日益复杂,研究数据的几何与拓扑结构已成为数据分析过程中愈发重要的组成部分。例如,对持久同调等拓扑工具的关注度持续上升即可佐证这一趋势。然而,一方面拓扑工具天然受限于仅能提供数据底层空间的粗略信息,另一方面更具几何性的方法主要依赖于流形假设——即假定底层空间是光滑流形。这一假设无法适用于许多物理模型,因其底层空间包含奇异点。本文开发了一种无需依赖任何光滑性假设即可捕获细粒度几何信息的机器学习流程。我们的方法在代数几何与代数簇的框架内展开,而非微分几何与光滑流形。在代数簇假设的设定下,学习问题转化为利用样本数据寻找底层代数簇。我们将该学习问题表述为最大后验概率优化问题,并通过特征值计算求解。在获得底层代数簇后,我们探索了利用格罗布纳基与数值方法揭示其几何信息的途径。特别地,我们提出了一种启发式数值方法,用于检测靠近底层代数簇奇异轨迹的数据点。