Stochastic gradient descent (SGD) is central to deep learning, yet the dynamical origin of its preference for flatter, more generalizable solutions remains unclear. Here, by analyzing SGD learning dynamics, we identify a nonequilibrium mechanism that governs solution selection during training. Numerical experiments reveal a transient exploratory phase in which SGD trajectories repeatedly escape sharp valleys and migrate toward flatter regions of the loss landscape before becoming confined to a final basin. Using a tractable physical model, we show that SGD noise reshapes the loss landscape into an effective potential that preferentially stabilizes flat solutions. We further uncover a transient freezing mechanism: as training progresses, the flattening landscape suppresses transitions between competing valleys. Stronger SGD noise delays this freezing transition, prolonging the exploratory phase and thereby increasing the probability of convergence to flatter minima. Together, these results provide a unified physical framework connecting learning dynamics, loss-landscape geometry, and generalization, and suggest guiding principles for the design of more effective optimization algorithms.
翻译:随机梯度下降(SGD)是深度学习中的核心算法,然而其偏好更平坦、更易泛化解的动态根源至今尚不清楚。本文通过分析SGD的学习动态,揭示了一种支配训练过程中解选择的非平衡机制。数值实验表明存在一个瞬态探索阶段,在此阶段SGD轨迹会反复逃离尖锐谷底,向损失景观中更平坦的区域迁移,最终被限制在某个最终盆地中。利用一个易于处理的物理模型,我们证明SGD噪声将损失景观重塑为一个有效势能,该势能优先稳定平坦解。我们进一步发现了一种瞬态冻结机制:随着训练的推进,逐渐平坦化的景观抑制了竞争谷底之间的跃迁。更强的SGD噪声会延迟这一冻结转变,延长探索阶段,从而增加收敛到更平坦极小值的概率。这些结果共同构建了一个连接学习动态、损失景观几何与泛化能力的统一物理框架,并为设计更有效的优化算法提供了指导原则。