Many real-world datasets live on high-dimensional Stiefel and Grassmannian manifolds, $V_k(\mathbb{R}^N)$ and $Gr(k, \mathbb{R}^N)$ respectively, and benefit from projection onto lower-dimensional Stiefel (respectively, Grassmannian) manifolds. In this work, we propose an algorithm called Principal Stiefel Coordinates (PSC) to reduce data dimensionality from $ V_k(\mathbb{R}^N)$ to $V_k(\mathbb{R}^n)$ in an $O(k)$-equivariant manner ($k \leq n \ll N$). We begin by observing that each element $\alpha \in V_n(\mathbb{R}^N)$ defines an isometric embedding of $V_k(\mathbb{R}^n)$ into $V_k(\mathbb{R}^N)$. Next, we optimize for such an embedding map that minimizes data fit error by warm-starting with the output of principal component analysis (PCA) and applying gradient descent. Then, we define a continuous and $O(k)$-equivariant map $\pi_\alpha$ that acts as a ``closest point operator'' to project the data onto the image of $V_k(\mathbb{R}^n)$ in $V_k(\mathbb{R}^N)$ under the embedding determined by $\alpha$, while minimizing distortion. Because this dimensionality reduction is $O(k)$-equivariant, these results extend to Grassmannian manifolds as well. Lastly, we show that the PCA output globally minimizes projection error in a noiseless setting, but that our algorithm achieves a meaningfully different and improved outcome when the data does not lie exactly on the image of a linearly embedded lower-dimensional Stiefel manifold as above. Multiple numerical experiments using synthetic and real-world data are performed.
翻译:许多真实世界的数据集存在于高维施蒂弗尔流形$V_k(\mathbb{R}^N)$和格拉斯曼流形$Gr(k, \mathbb{R}^N)$上,并且受益于投影到低维施蒂弗尔(或格拉斯曼)流形。本文提出一种名为主施蒂弗尔坐标(PSC)的算法,以$O(k)$-等变方式($k \leq n \ll N$)将数据维度从$V_k(\mathbb{R}^N)$降低至$V_k(\mathbb{R}^n)$。我们首先观察到,每个元素$\alpha \in V_n(\mathbb{R}^N)$定义了从$V_k(\mathbb{R}^n)$到$V_k(\mathbb{R}^N)$的等距嵌入。接着,我们通过以主成分分析(PCA)输出作为热启动并应用梯度下降,优化使得数据拟合误差最小化的嵌入映射。然后,我们定义一个连续且$O(k)$-等变的映射$\pi_\alpha$,它充当"最近点算子",在由$\alpha$确定的嵌入下,将数据投影到$V_k(\mathbb{R}^n)$在$V_k(\mathbb{R}^N)$中的像上,同时最小化失真。由于这种降维是$O(k)$-等变的,这些结果也适用于格拉斯曼流形。最后,我们证明在无噪声场景下PCA输出全局最小化投影误差,但当数据不完全位于线性嵌入的低维施蒂弗尔流形像上时,我们的算法取得有意义的差异化和改进结果。使用合成数据和真实数据进行了多项数值实验。