First-order methods, such as gradient descent (GD) and stochastic gradient descent (SGD), have been proven effective in training neural networks. In the context of over-parameterization, there is a line of work demonstrating that randomly initialized (stochastic) gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. However, the learning rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix, leading to a slow training process. In this paper, we show that for the $L^2$ regression problems, the learning rate can be improved from $\mathcal{O}(\lambda_0/n^2)$ to $\mathcal{O}(1/\|\bm{H}^{\infty}\|_2)$, which implies that GD actually enjoys a faster convergence rate. Furthermore, we generalize the method to GD in training two-layer Physics-Informed Neural Networks (PINNs), showing a similar improvement for the learning rate. Although the improved learning rate has a mild dependence on the Gram matrix, we still need to set it small enough in practice due to the unknown eigenvalues of the Gram matrix. More importantly, the convergence rate is tied to the least eigenvalue of the Gram matrix, which can lead to slow convergence. In this work, we provide the convergence analysis of natural gradient descent (NGD) in training two-layer PINNs, demonstrating that the learning rate can be $\mathcal{O}(1)$, and at this rate, the convergence rate is independent of the Gram matrix.
翻译:一阶方法,如梯度下降(GD)和随机梯度下降(SGD),已被证明在训练神经网络方面是有效的。在过参数化背景下,一系列研究表明,对于二次损失函数,随机初始化的(随机)梯度下降以线性收敛速度收敛到全局最优解。然而,用于训练双层神经网络的GD学习率对样本大小和格拉姆矩阵表现出较差的依赖性,导致训练过程缓慢。在本文中,我们证明对于$L^2$回归问题,学习率可以从$\mathcal{O}(\lambda_0/n^2)$改进为$\mathcal{O}(1/\|\bm{H}^{\infty}\|_2)$,这意味着GD实际上享有更快的收敛速度。此外,我们将该方法推广到训练双层物理信息神经网络(PINNs)的GD中,展示了学习率的类似改进。尽管改进后的学习率对格拉姆矩阵有温和的依赖性,但由于格拉姆矩阵特征值未知,在实践中我们仍需要将其设置得足够小。更重要的是,收敛速度与格拉姆矩阵的最小特征值相关,这可能导致收敛缓慢。在本工作中,我们提供了训练双层PINNs的自然梯度下降(NGD)收敛性分析,证明学习率可以是$\mathcal{O}(1)$,并且在此速率下,收敛速度与格拉姆矩阵无关。