The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by high learning rate or small batch size ("SGD noise"). While prior works that focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that large learning rate and small batch size do not confer any implicit bias advantages in online learning. In contrast to offline learning, the benefits of SGD noise in online learning are strictly computational, facilitating larger or more cost-effective gradient steps. Our work suggests that SGD in the online regime can be construed as taking noisy steps along the "golden path" of the noiseless gradient flow algorithm. We provide evidence to support this hypothesis by conducting experiments that reduce SGD noise during training and by measuring the pointwise functional distance between models trained with varying SGD noise levels, but at equivalent loss values. Our findings challenge the prevailing understanding of SGD and offer novel insights into its role in online learning.
翻译:深度学习领域SGD的成功先前被归因于高学习率或小批量大小所诱导的隐式偏差(“SGD噪声”)。与先前聚焦于离线学习(即多轮训练)的研究不同,我们研究了SGD噪声对在线学习(即单轮训练)的影响。通过对图像和语言数据的广泛实证分析,我们证明高学习率和小批量大小在在线学习中并未带来任何隐式偏差优势。与离线学习相反,在线学习中SGD噪声的益处纯粹是计算层面的,即有利于实现更大或更经济的梯度步长。我们的工作表明,在线学习场景下的SGD可被理解为沿着无噪声梯度流算法的“黄金路径”执行含噪步长。通过开展降低训练过程中SGD噪声的实验,并测量在等效损失值下不同SGD噪声水平训练的模型之间的逐点函数距离,我们为这一假设提供了证据支持。我们的发现挑战了当前对SGD的主流理解,并为其在在线学习中的作用提供了新颖见解。