Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a single sampled gradient (or a minibatch) in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.
翻译:现代机器学习范式(如深度学习)发生在插值区域或接近插值区域,此时模型参数数量远大于数据样本数量。本文提出了一种插值区域内的正则性条件,该条件使得随机梯度方法在每次迭代仅使用单个采样梯度(或小批量梯度)时,仍能具有与确定性梯度方法相同的 worst-case 迭代复杂度。相比之下,现有所有理论保证均要求随机梯度方法采用小步长,从而导致更慢的线性收敛速率。最后,我们证明当训练带有线性输出层的足够宽的前馈神经网络时,该条件成立。