This work studies the global convergence and implicit bias of Gauss Newton's (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method. While GN is consistently faster than GD in finding a global optimum, the learned model generalizes well on test data when starting from random initial weights with a small variance and using a small step size to slow down convergence. Specifically, our study shows that such a setting results in a hidden learning phenomenon, where the dynamics are able to recover features with good generalization properties despite the model having sub-optimal training and test performances due to an under-optimized linear layer. This study exhibits a trade-off between the convergence speed of GN and the generalization ability of the learned solution.
翻译:本研究探讨了在平均场机制下优化过参数化单隐层网络时,高斯-牛顿方法(GN)的全局收敛性与隐式偏差。我们首先在连续时间极限下建立了GN的全局收敛结果,与梯度下降法(GD)相比,由于条件数的改善,该方法表现出更快的收敛速度。随后,我们通过一项合成回归任务的实证研究考察了GN方法的隐式偏差。尽管GN在寻找全局最优解时始终快于GD,但当从具有小方差的随机初始权重出发,并采用小步长以减缓收敛速度时,学习到的模型在测试数据上具有良好的泛化能力。具体而言,我们的研究表明,这种设置会产生一种隐式学习现象:尽管由于线性层优化不充分而导致模型训练和测试性能次优,但动力学过程仍能恢复具有良好泛化特性的特征。这项研究揭示了GN的收敛速度与学习解的泛化能力之间的权衡关系。