How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization

This paper rigorously shows how over-parameterization changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements. First, we consider the symmetric setting with the symmetric parameterization where $M^* \in \mathbb{R}^{n \times n}$ is a positive semi-definite unknown matrix of rank $r \ll n$, and one uses a symmetric parameterization $XX^\top$ to learn $M^*$. Here $X \in \mathbb{R}^{n \times k}$ with $k > r$ is the factor matrix. We give a novel $\Omega (1/T^2)$ lower bound of randomly initialized GD for the over-parameterized case ($k >r$) where $T$ is the number of iterations. This is in stark contrast to the exact-parameterization scenario ($k=r$) where the convergence rate is $\exp (-\Omega (T))$. Next, we study asymmetric setting where $M^* \in \mathbb{R}^{n_1 \times n_2}$ is the unknown matrix of rank $r \ll \min\{n_1,n_2\}$, and one uses an asymmetric parameterization $FG^\top$ to learn $M^*$ where $F \in \mathbb{R}^{n_1 \times k}$ and $G \in \mathbb{R}^{n_2 \times k}$. Building on prior work, we give a global exact convergence result of randomly initialized GD for the exact-parameterization case ($k=r$) with an $\exp (-\Omega(T))$ rate. Furthermore, we give the first global exact convergence result for the over-parameterization case ($k>r$) with an $\exp(-\Omega(\alpha^2 T))$ rate where $\alpha$ is the initialization scale. This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from $\Omega (1/T^2)$ to linear convergence. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of $\alpha$, recovering the rate in the exact-parameterization case.

翻译：本文严格论证了过度参数化如何改变梯度下降（GD）在矩阵感知问题中的收敛行为，该问题的目标是从近各向同性线性测量中恢复未知的低秩真实矩阵。首先，我们考虑对称参数化下的对称情形：设$M^* \in \mathbb{R}^{n \times n}$为秩$r \ll n$的半正定未知矩阵，通过对称参数化$XX^\top$学习$M^*$，其中$X \in \mathbb{R}^{n \times k}$且$k > r$为因子矩阵。我们给出了随机初始化GD在过度参数化情形（$k>r$）中新颖的$\Omega (1/T^2)$下界（$T$为迭代次数）。这与精确参数化情形（$k=r$）下$\exp (-\Omega (T))$的收敛速率形成鲜明对比。其次，我们研究非对称情形：设$M^* \in \mathbb{R}^{n_1 \times n_2}$为秩$r \ll \min\{n_1,n_2\}$的未知矩阵，通过非对称参数化$FG^\top$学习$M^*$，其中$F \in \mathbb{R}^{n_1 \times k}$，$G \in \mathbb{R}^{n_2 \times k}$。基于先前工作，我们给出了精确参数化情形（$k=r$）下随机初始化GD的全局精确收敛结果，收敛速率为$\exp (-\Omega(T))$。进一步，我们首次建立了过度参数化情形（$k>r$）下以$\exp(-\Omega(\alpha^2 T))$速率（$\alpha$为初始化尺度）的全局精确收敛结果。该线性收敛结果在过度参数化情形中尤为重要，因为可将非对称参数化应用于对称情形，将收敛速度从$\Omega (1/T^2)$提升至线性收敛。另一方面，我们提出仅修改GD一步的新方法，获得与$\alpha$无关的收敛速率，恢复精确参数化情形的速率。