How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization

This paper rigorously shows how over-parameterization changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements. First, we consider the symmetric setting with the symmetric parameterization where $M^* \in \mathbb{R}^{n \times n}$ is a positive semi-definite unknown matrix of rank $r \ll n$, and one uses a symmetric parameterization $XX^\top$ to learn $M^*$. Here $X \in \mathbb{R}^{n \times k}$ with $k > r$ is the factor matrix. We give a novel $\Omega (1/T^2)$ lower bound of randomly initialized GD for the over-parameterized case ($k >r$) where $T$ is the number of iterations. This is in stark contrast to the exact-parameterization scenario ($k=r$) where the convergence rate is $\exp (-\Omega (T))$. Next, we study asymmetric setting where $M^* \in \mathbb{R}^{n_1 \times n_2}$ is the unknown matrix of rank $r \ll \min\{n_1,n_2\}$, and one uses an asymmetric parameterization $FG^\top$ to learn $M^*$ where $F \in \mathbb{R}^{n_1 \times k}$ and $G \in \mathbb{R}^{n_2 \times k}$. Building on prior work, we give a global exact convergence result of randomly initialized GD for the exact-parameterization case ($k=r$) with an $\exp (-\Omega(T))$ rate. Furthermore, we give the first global exact convergence result for the over-parameterization case ($k>r$) with an $\exp(-\Omega(\alpha^2 T))$ rate where $\alpha$ is the initialization scale. This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from $\Omega (1/T^2)$ to linear convergence. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of $\alpha$, recovering the rate in the exact-parameterization case.

翻译：摘要：本文严格论证了过参数化如何改变梯度下降（GD）在矩阵感知问题中的收敛行为，该问题旨在从近各向同性线性测量中恢复未知的低秩真实矩阵。首先，我们考虑对称参数化下的对称设定，其中 $M^* \in \mathbb{R}^{n \times n}$ 是秩为 $r \ll n$ 的半正定未知矩阵，并采用对称参数化 $XX^\top$ 来学习 $M^*$，这里 $X \in \mathbb{R}^{n \times k}$（$k > r$）为因子矩阵。我们针对过参数化情形（$k > r$）给出了随机初始化GD的一个新颖的下界 $\Omega (1/T^2)$，其中 $T$ 为迭代次数。这与精确参数化情形（$k=r$）下收敛速度为 $\exp (-\Omega (T))$ 形成鲜明对比。其次，我们研究非对称设定，其中 $M^* \in \mathbb{R}^{n_1 \times n_2}$ 为秩 $r \ll \min\{n_1,n_2\}$ 的未知矩阵，采用非对称参数化 $FG^\top$ 学习 $M^*$，其中 $F \in \mathbb{R}^{n_1 \times k}$ 和 $G \in \mathbb{R}^{n_2 \times k}$。基于已有工作，我们给出了精确参数化情形（$k=r$）下随机初始化GD的全局精确收敛结果，收敛速度为 $\exp (-\Omega(T))$。进一步，我们首次给出了过参数化情形（$k>r$）下收敛速度为 $\exp(-\Omega(\alpha^2 T))$ 的全局精确收敛结果，其中 $\alpha$ 为初始化尺度。过参数化情形下的线性收敛结果尤为重要，因为可将非对称参数化应用于对称设定，将收敛速度从 $\Omega (1/T^2)$ 提升至线性收敛。另一方面，我们提出了一种仅需修改GD单步操作的新方法，其收敛速度与 $\alpha$ 无关，恢复至精确参数化情形的收敛速率。