The $L_{2}$-regularized loss of Deep Linear Networks (DLNs) with more than one hidden layers has multiple local minima, corresponding to matrices with different ranks. In tasks such as matrix completion, the goal is to converge to the local minimum with the smallest rank that still fits the training data. While rank-underestimating minima can be avoided since they do not fit the data, GD might get stuck at rank-overestimating minima. We show that with SGD, there is always a probability to jump from a higher rank minimum to a lower rank one, but the probability of jumping back is zero. More precisely, we define a sequence of sets $B_{1}\subset B_{2}\subset\cdots\subset B_{R}$ so that $B_{r}$ contains all minima of rank $r$ or less (and not more) that are absorbing for small enough ridge parameters $\lambda$ and learning rates $\eta$: SGD has prob. 0 of leaving $B_{r}$, and from any starting point there is a non-zero prob. for SGD to go in $B_{r}$.
翻译:具有多个隐藏层的深度线性网络(DLNs)的$L_{2}$正则化损失存在多个局部极小值,对应于不同秩的矩阵。在矩阵补全等任务中,目标是收敛到拟合训练数据且秩最小的局部极小值。由于低估秩的极小值无法拟合数据,可以避免这些解,但梯度下降(GD)可能陷入高估秩的极小值。我们证明,使用随机梯度下降(SGD)时,从高秩极小值跳向低秩极小值始终存在非零概率,但反向跳跃的概率为零。具体而言,我们定义了一组序列$B_{1}\subset B_{2}\subset\cdots\subset B_{R}$,其中$B_{r}$包含所有秩为$r$或更低(而非更高)的极小值,且对于足够小的岭参数$\lambda$和学习率$\eta$,这些极小值具有吸收性:SGD离开$B_{r}$的概率为0,而从任意起始点出发,SGD进入$B_{r}$的概率非零。