Neural networks are typically optimized with variants of stochastic gradient descent. Under a squared loss, however, the optimal solution to the linear last layer weights is known in closed-form. We propose to leverage this during optimization, treating the last layer as a function of the backbone parameters, and optimizing solely for these parameters. We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer. We adapt the method for the setting of stochastic gradient descent, by trading off the loss on the current batch against the accumulated information from previous batches. Further, we prove that, in the Neural Tangent Kernel regime, convergence of this method to an optimal solution is guaranteed. Finally, we demonstrate the effectiveness of our approach compared with standard SGD on a squared loss in several supervised tasks -- both regression and classification -- including Fourier Neural Operators and Instrumental Variable Regression.
翻译:神经网络通常通过随机梯度下降的变体进行优化。然而,在平方损失下,线性末层权重的最优解存在闭式表达式。我们提出在优化过程中利用这一点,将末层视为主干参数的函数,并仅针对这些主干参数进行优化。我们证明,这等价于交替执行主干参数的梯度下降步骤与末层参数的闭式更新。我们通过权衡当前批次的损失与先前批次累积的信息,使该方法适应随机梯度下降的设置。此外,我们证明,在神经正切核机制下,该方法收敛到最优解是有保证的。最后,我们在多个监督任务(包括回归和分类)中,例如傅里叶神经算子和工具变量回归,展示了我们的方法在平方损失下相比标准随机梯度下降的有效性。