Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width

from arxiv, Accepted at NeurIPS 2023 (camera-ready version): Additional results added for cross-entropy loss and effect on network output at initialization; 10+32 pages, 8+35 figures

We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) and study the effect of learning rate $\eta$, depth $d$, and width $w$ of the neural network. By analyzing the maximum eigenvalue $\lambda^H_t$ of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and (iv) a late time ``edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on $\eta \equiv c / \lambda_0^H $, $d$, and $w$. We identify several critical values of $c$, which separate qualitatively distinct phenomena in the early time dynamics of training loss and sharpness. Notably, we discover the opening up of a ``sharpness reduction" phase, where sharpness decreases at early times, as $d$ and $1/w$ are increased.

翻译：我们系统分析了使用随机梯度下降（SGD）训练的深度神经网络（DNN）中的优化动力学，研究了学习率$\eta$、网络深度$d$和宽度$w$的影响。通过分析损失函数Hessian矩阵的最大特征值$\lambda^H_t$（损失景观尖锐度的度量），我们发现动力学可呈现四种不同区域：（i）早期瞬态区域，（ii）中间饱和区域，（iii）渐进尖锐化区域，以及（iv）后期"稳定性边界"区域。早期和中间区域（i）和（ii）表现出丰富的相图，其行为取决于$\eta \equiv c / \lambda_0^H$、$d$和$w$。我们识别出$c$的几个临界值，这些临界值区分了训练损失和尖锐度早期动力学中性质截然不同的现象。值得注意的是，我们发现随着$d$和$1/w$的增大，会出现一个"尖锐度降低"相，即尖锐度在早期阶段下降。

相关内容

Neural Networks

关注 1654

神经网络（Neural Networks）是世界上三个最古老的神经建模学会的档案期刊:国际神经网络学会(INNS)、欧洲神经网络学会(ENNS)和日本神经网络学会(JNNS)。神经网络提供了一个论坛，以发展和培育一个国际社会的学者和实践者感兴趣的所有方面的神经网络和相关方法的计算智能。神经网络欢迎高质量论文的提交，有助于全面的神经网络研究，从行为和大脑建模，学习算法，通过数学和计算分析，系统的工程和技术应用，大量使用神经网络的概念和技术。这一独特而广泛的范围促进了生物和技术研究之间的思想交流，并有助于促进对生物启发的计算智能感兴趣的跨学科社区的发展。因此，神经网络编委会代表的专家领域包括心理学，神经生物学，计算机科学，工程，数学，物理。该杂志发表文章、信件和评论以及给编辑的信件、社论、时事、软件调查和专利信息。文章发表在五个部分之一:认知科学，神经科学，学习系统，数学和计算分析、工程和应用。官网地址：http://dblp.uni-trier.de/db/journals/nn/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Nat. Biotechnol. | 机器学习为生物库驱动的药物发现提供动力

专知会员服务

11+阅读 · 2022年9月12日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日