We present results of numerical experiments for neural networks with stochastic gradient-based optimization with adaptive momentum. This widely applied optimization has proved convergence and practical efficiency, but for long-run training becomes numerically unstable. We show that numerical artifacts are observable not only for large-scale models and finally lead to divergence also for case of shallow narrow networks. We argue this theory by experiments with more than 1600 neural networks trained for 50000 epochs. Local observations show presence of the same behavior of network parameters in both stable and unstable training segments. Geometrical behavior of parameters forms double twisted spirals in the parameter space and is caused by alternating of numerical perturbations with next relaxation oscillations in values for 1st and 2nd momentum.
翻译:我们展示了基于随机梯度自适应动量优化的神经网络数值实验结果。这种广泛应用的最优化方法已被证明具有收敛性和实际效率,但在长期训练中会变得数值不稳定。我们证明数值伪影不仅在大规模模型中可观测到,最终对于浅层窄网络也会导致发散。我们通过对超过1600个训练50000周期的神经网络进行实验论证了这一理论。局部观测表明,在稳定和不稳定的训练阶段,网络参数均呈现相同的行为模式。参数在参数空间中形成双扭曲螺旋的几何行为,是由数值扰动与第一动量和第二动量值的后续弛豫振荡交替作用所导致的。