We study the convergence rate of first-order methods for rectangular matrix factorization, which is a canonical nonconvex optimization problem. Specifically, given a rank-$r$ matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$, we prove that gradient descent (GD) can find a pair of $\epsilon$-optimal solutions $\mathbf{X}_T\in\mathbb{R}^{m\times d}$ and $\mathbf{Y}_T\in\mathbb{R}^{n\times d}$, where $d\geq r$, satisfying $\lVert\mathbf{X}_T\mathbf{Y}_T^\top-\mathbf{A}\rVert_\mathrm{F}\leq\epsilon\lVert\mathbf{A}\rVert_\mathrm{F}$ in $T=O(\kappa^2\log\frac{1}{\epsilon})$ iterations with high probability, where $\kappa$ denotes the condition number of $\mathbf{A}$. Furthermore, we prove that Nesterov's accelerated gradient (NAG) attains an iteration complexity of $O(\kappa\log\frac{1}{\epsilon})$, which is the best-known bound of first-order methods for rectangular matrix factorization. Different from small balanced random initialization in the existing literature, we adopt an unbalanced initialization, where $\mathbf{X}_0$ is large and $\mathbf{Y}_0$ is $0$. Moreover, our initialization and analysis can be further extended to linear neural networks, where we prove that NAG can also attain an accelerated linear convergence rate. In particular, we only require the width of the network to be greater than or equal to the rank of the output label matrix. In contrast, previous results achieving the same rate require excessive widths that additionally depend on the condition number and the rank of the input data matrix.
翻译:我们研究了矩形矩阵分解这一典型非凸优化问题的一阶方法收敛速率。具体而言,给定一个秩为$r$的矩阵$\mathbf{A}\in\mathbb{R}^{m\times n}$,我们证明梯度下降法(GD)能以高概率在$T=O(\kappa^2\log\frac{1}{\epsilon})$次迭代内找到一对$\epsilon$最优解$\mathbf{X}_T\in\mathbb{R}^{m\times d}$和$\mathbf{Y}_T\in\mathbb{R}^{n\times d}$(其中$d\geq r$),满足$\lVert\mathbf{X}_T\mathbf{Y}_T^\top-\mathbf{A}\rVert_\mathrm{F}\leq\epsilon\lVert\mathbf{A}\rVert_\mathrm{F}$,此处$\kappa$表示$\mathbf{A}$的条件数。进一步,我们证明Nesterov加速梯度法(NAG)能达到$O(\kappa\log\frac{1}{\epsilon})$的迭代复杂度,这是目前矩形矩阵分解一阶方法的最佳已知界。与现有文献中采用的小规模平衡随机初始化不同,我们采用非平衡初始化策略,即令$\mathbf{X}_0$取较大值而$\mathbf{Y}_0$置零。此外,我们的初始化方法与分析框架可进一步推广至线性神经网络,其中我们证明NAG同样能达到加速线性收敛速率。特别值得注意的是,我们仅要求网络宽度大于或等于输出标签矩阵的秩。相比之下,先前获得相同收敛速率的研究结果需要额外依赖于输入数据矩阵条件数与秩的过大网络宽度。