We consider alternating gradient descent (AGD) with fixed step size $\eta > 0$, applied to the asymmetric matrix factorization objective. We show that, for a rank-$r$ matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, $T = \left( \left(\frac{\sigma_1(\mathbf{A})}{\sigma_r(\mathbf{A})}\right)^2 \log(1/\epsilon)\right)$ iterations of alternating gradient descent suffice to reach an $\epsilon$-optimal factorization $\| \mathbf{A} - \mathbf{X}_T^{\vphantom{\intercal}} \mathbf{Y}_T^{\intercal} \|_{\rm F}^2 \leq \epsilon \| \mathbf{A} \|_{\rm F}^2$ with high probability starting from an atypical random initialization. The factors have rank $d>r$ so that $\mathbf{X}_T\in\mathbb{R}^{m \times d}$ and $\mathbf{Y}_T \in\mathbb{R}^{n \times d}$. Experiments suggest that our proposed initialization is not merely of theoretical benefit, but rather significantly improves convergence of gradient descent in practice. Our proof is conceptually simple: a uniform PL-inequality and uniform Lipschitz smoothness constant are guaranteed for a sufficient number of iterations, starting from our random initialization. Our proof method should be useful for extending and simplifying convergence analyses for a broader class of nonconvex low-rank factorization problems.
翻译:我们考虑使用固定步长 $\eta > 0$ 的交替梯度下降法(AGD)求解非对称矩阵分解目标函数。研究表明,对于秩为 $r$ 的矩阵 $\mathbf{A} \in \mathbb{R}^{m \times n}$,从非典型随机初始化出发,交替梯度下降算法经过 $T = \left( \left(\frac{\sigma_1(\mathbf{A})}{\sigma_r(\mathbf{A})}\right)^2 \log(1/\epsilon)\right)$ 次迭代后,能以高概率达到 $\epsilon$-最优分解 $\| \mathbf{A} - \mathbf{X}_T^{\vphantom{\intercal}} \mathbf{Y}_T^{\intercal} \|_{\rm F}^2 \leq \epsilon \| \mathbf{A} \|_{\rm F}^2$。因子矩阵的秩为 $d>r$,即 $\mathbf{X}_T\in\mathbb{R}^{m \times d}$,$\mathbf{Y}_T \in\mathbb{R}^{n \times d}$。实验表明,本文提出的初始化方法不仅具有理论价值,更能显著提升梯度下降法在实际应用中的收敛速度。我们的证明思路简洁:从随机初始化出发,算法在足够多的迭代次数内保证满足均匀PL不等式与均匀Lipschitz光滑常数。该证明方法有望推广并简化更广泛非凸低秩分解问题的收敛性分析。