When a deep ReLU network is initialized with small weights, gradient descent (GD) is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions along which GD leaves the origin, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a low-rank bias in its deeper layers: the first singular value of the $\ell$-th layer weight matrix is at least $\ell^{\frac{1}{4}}$ larger than any other singular value. We also prove a number of related results about these escape directions. We suggest that deep ReLU networks exhibit saddle-to-saddle dynamics, with GD visiting a sequence of saddles with increasing bottleneck rank (Jacot, 2023).
翻译:当深度ReLU网络以小权重初始化时,梯度下降(GD)在参数空间中最初受原点鞍点支配。我们研究了GD离开原点时所遵循的所谓逃逸方向,这些方向类似于严格鞍点处Hessian矩阵的特征向量。研究表明,最优逃逸方向在深层网络中具有低秩偏好:第$\ell$层权重矩阵的第一个奇异值至少比其他奇异值大$\ell^{\frac{1}{4}}$倍。我们还证明了关于这些逃逸方向的一些相关结果。我们提出,深度ReLU网络表现出鞍点至鞍点动力学,即GD会依次访问一系列瓶颈秩递增的鞍点(Jacot, 2023)。