A fundamental problem in machine learning is understanding the effect of early stopping on the parameters obtained and the generalization capabilities of the model. Even for linear models, the effect is not fully understood for arbitrary learning rates and data. In this paper, we analyze the dynamics of discrete full batch gradient descent for linear regression. With minimal assumptions, we characterize the trajectory of the parameters and the expected excess risk. Using this characterization, we show that when training with a learning rate schedule $\eta_k$, and a finite time horizon $T$, the early stopped solution $\beta_T$ is equivalent to the minimum norm solution for a generalized ridge regularized problem. We also prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules. We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.
翻译:机器学习中的一个基本问题是理解早期停止对所得参数及模型泛化能力的影响。即使对于线性模型,在任意学习率和数据条件下,这种影响尚未被完全理解。本文分析了线性回归中离散全批量梯度下降的动态过程。在最小假设下,我们刻画了参数轨迹与期望超额风险的特性。基于此特性,我们证明当采用学习率调度 $\eta_k$ 和有限时间范围 $T$ 进行训练时,早期停止解 $\beta_T$ 等价于广义岭正则化问题的最小范数解。我们还证明了对于具有任意谱特性的通用数据及多种学习率调度方案,早期停止均具有益处。我们给出了最优停止时间的估计,并通过实验验证了该估计的准确性。