Stochastic gradient descent (SGD), a widely used algorithm in deep-learning neural networks has attracted continuing studies for the theoretical principles behind its success. A recent work reports an anomaly (inverse) relation between the variance of neural weights and the landscape flatness of the loss function driven under SGD [Feng & Tu, PNAS 118, 0027 (2021)]. To investigate this seemingly violation of statistical physics principle, the properties of SGD near fixed points are analysed via a dynamic decomposition method. Our approach recovers the true "energy" function under which the universal Boltzmann distribution holds. It differs from the cost function in general and resolves the paradox raised by the the anomaly. The study bridges the gap between the classical statistical mechanics and the emerging discipline of artificial intelligence, with potential for better algorithms to the latter.
翻译:随机梯度下降(SGD)作为深度学习神经网络中广泛应用的算法,其成功背后的理论原理持续吸引着学界研究。近期工作报告了在SGD驱动下神经权重方差与损失函数景观平坦度之间的反常(逆)关系[Feng & Tu, PNAS 118, 0027 (2021)]。为探究这一疑似违反统计物理原理的现象,我们采用动态分解方法分析了SGD在不动点附近的特性。该方法恢复了普适玻尔兹曼分布所依赖的真实"能量"函数。该函数通常不同于代价函数,并解决了由该反常现象引发的悖论。本研究弥合了经典统计力学与新兴人工智能学科之间的鸿沟,有望为后者提供更优算法。