Neural networks are usually trained with different variants of gradient descent based optimization algorithms such as stochastic gradient descent or the Adam optimizer. Recent theoretical work states that the critical points (where the gradient of the loss is zero) of two-layer ReLU networks with the square loss are not all local minima. However, in this work we will explore an algorithm for training two-layer neural networks with ReLU-like activation and the square loss that alternatively finds the critical points of the loss function analytically for one layer while keeping the other layer and the neuron activation pattern fixed. Experiments indicate that this simple algorithm can find deeper optima than Stochastic Gradient Descent or the Adam optimizer, obtaining significantly smaller training loss values on four out of the five real datasets evaluated. Moreover, the method is faster than the gradient descent methods and has virtually no tuning parameters.
翻译:神经网络通常通过基于梯度下降的优化算法(如随机梯度下降或Adam优化器)的各种变体进行训练。近期理论研究表明,采用平方损失的双层ReLU网络,其临界点(损失函数梯度为零的点)并不全是局部极小值。然而,本文探讨了一种训练双层ReLU类激活函数网络并采用平方损失的算法,该算法通过交替固定另一层和神经元激活模式,以解析方式逐层求解损失函数的临界点。实验表明,这种简单算法能够比随机梯度下降或Adam优化器找到更优的深度极小值,在五个真实数据集的四个中获得了显著更低的训练损失值。此外,该方法比梯度下降法速度更快,且几乎无需调整参数。