Principal Component Analysis (PCA) is a workhorse of modern data science. While PCA assumes the data conforms to Euclidean geometry, for specific data types, such as hierarchical and cyclic data structures, other spaces are more appropriate. We study PCA in space forms; that is, those with constant curvatures. At a point on a Riemannian manifold, we can define a Riemannian affine subspace based on a set of tangent vectors. Finding the optimal low-dimensional affine subspace for given points in a space form amounts to dimensionality reduction. Our Space Form PCA (SFPCA) seeks the affine subspace that best represents a set of manifold-valued points with the minimum projection cost. We propose proper cost functions that enjoy two properties: (1) their optimal affine subspace is the solution to an eigenequation, and (2) optimal affine subspaces of different dimensions form a nested set. These properties provide advances over existing methods, which are mostly iterative algorithms with slow convergence and weaker theoretical guarantees. We evaluate the proposed SFPCA on real and simulated data in spherical and hyperbolic spaces. We show that it outperforms alternative methods in estimating true subspaces (in simulated data) with respect to convergence speed or accuracy, often both.
翻译:主成分分析(PCA)是现代数据科学的基石工具。尽管PCA假定数据符合欧几里得几何,但对于特定数据类型,例如层次结构和循环数据结构,其他空间更为适用。我们研究了空间形式(即具有恒定曲率的空间)中的PCA。在黎曼流形上的某一点,我们可以基于一组切向量定义一个黎曼仿射子空间。在空间形式中为给定点寻找最优的低维仿射子空间等同于降维。我们的空间形式主成分分析(SFPCA)旨在寻找能以最小投影成本最佳表示一组流形值点的仿射子空间。我们提出了具有以下两个特性的适当成本函数:(1)其最优仿射子空间是某个特征方程的解;(2)不同维度的最优仿射子空间构成一个嵌套集。这些特性相较于现有方法提供了进步,现有方法大多是收敛速度慢且理论保证较弱的迭代算法。我们在球面空间和双曲空间的真实与模拟数据上评估了所提出的SFPCA。结果表明,在估计真实子空间(在模拟数据中)方面,SFPCA在收敛速度或精度上,通常在这两方面均优于其他方法。