The $L_{2}$-regularized loss of Deep Linear Networks (DLNs) with more than one hidden layers has multiple local minima, corresponding to matrices with different ranks. In tasks such as matrix completion, the goal is to converge to the local minimum with the smallest rank that still fits the training data. While rank-underestimating minima can easily be avoided since they do not fit the data, gradient descent might get stuck at rank-overestimating minima. We show that with SGD, there is always a probability to jump from a higher rank minimum to a lower rank one, but the probability of jumping back is zero. More precisely, we define a sequence of sets $B_{1}\subset B_{2}\subset\cdots\subset B_{R}$ so that $B_{r}$ contains all minima of rank $r$ or less (and not more) that are absorbing for small enough ridge parameters $\lambda$ and learning rates $\eta$: SGD has prob. 0 of leaving $B_{r}$, and from any starting point there is a non-zero prob. for SGD to go in $B_{r}$.
翻译:具有多个隐藏层的深度线性网络的$L_{2}$正则化损失函数存在多个局部极小值,这些极小值对应不同秩的矩阵。在矩阵补全等任务中,目标是收敛到既能拟合训练数据又具有最小秩的局部极小值。虽然低估秩的极小值因无法拟合数据而易于避免,但梯度下降可能陷入高估秩的极小值。我们证明,在随机梯度下降(SGD)下,从高秩极小值跳跃到低秩极小值始终存在非零概率,但反向跳跃的概率为零。更精确地,我们定义了一组嵌套集合$B_{1}\subset B_{2}\subset\cdots\subset B_{R}$,使得$B_{r}$包含所有秩不超过$r$(而非更高秩)的极小值,且这些极小值对于足够小的岭参数$\lambda$和学习率$\eta$具有吸收性:SGD离开$B_{r}$的概率为零,且从任意初始点出发,SGD进入$B_{r}$均有非零概率。