Training a neural network requires choosing a suitable learning rate, which involves a trade-off between speed and effectiveness of convergence. While there has been considerable theoretical and empirical analysis of how large the learning rate can be, most prior work focuses only on late-stage training. In this work, we introduce the maximal initial learning rate $\eta^{\ast}$ - the largest learning rate at which a randomly initialized neural network can successfully begin training and achieve (at least) a given threshold accuracy. Using a simple approach to estimate $\eta^{\ast}$, we observe that in constant-width fully-connected ReLU networks, $\eta^{\ast}$ behaves differently from the maximum learning rate later in training. Specifically, we find that $\eta^{\ast}$ is well predicted as a power of depth $\times$ width, provided that (i) the width of the network is sufficiently large compared to the depth, and (ii) the input layer is trained at a relatively small learning rate. We further analyze the relationship between $\eta^{\ast}$ and the sharpness $\lambda_{1}$ of the network at initialization, indicating they are closely though not inversely related. We formally prove bounds for $\lambda_{1}$ in terms of depth $\times$ width that align with our empirical results.
翻译:训练神经网络需要选择合适的学习率,这涉及收敛速度与有效性之间的权衡。尽管已有大量理论和实证分析探讨学习率的最大可能取值,但大多数先前工作仅关注训练后期。本文提出了最大初始学习率 $\eta^{\ast}$ ——随机初始化的神经网络能够成功开始训练并达到(至少)给定阈值精度的最大学习率。通过一种估算 $\eta^{\ast}$ 的简单方法,我们观察到在恒定宽度的全连接 ReLU 网络中, $\eta^{\ast}$ 的行为与训练后期的最大学习率不同。具体而言,我们发现当(i)网络宽度相对于深度足够大,且(ii)输入层以相对较小的学习率训练时, $\eta^{\ast}$ 可很好地预测为深度 $\times$ 宽度的幂次函数。我们进一步分析了 $\eta^{\ast}$ 与网络初始化时尖锐度 $\lambda_{1}$ 之间的关系,表明两者虽然并非严格反比,但密切相关。我们形式化证明了 $\lambda_{1}$ 关于深度 $\times$ 宽度的界,该结果与我们的实证结果一致。