With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters $(β_1, β_2)$ controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on $β_1$, $β_2$) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher $β_2$ increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regulariation on $β_2$ is reversed. A similar monotonicity shift (in the opposite direction) happens in $β_1$. In particular, the commonly "default" pair $(β_1, β_2) = (0.9, 0.999)$ is a good choice if batches are small; for larger batches, in many settings moving $β_1$ closer to $β_2$ is much better in terms of validation accuracy in multi-epoch training. Moreover, our theoretical derivations connect the scale of the batch size at which the shift happens to the scale of the critical batch size. We illustrate this effect in experiments with small-scale data in the about-to-overfit regime.
翻译:在高质量数据有限而计算资源不断增长的背景下,多轮训练在深度学习各子领域的重要性正重新凸显。Adam(W)及其变体是许多任务(如下一个词元预测)的首选优化器,它包含两个控制记忆的动量超参数$(β_1, β_2)$,以及一个极为重要的超参数——批量大小,后者尤其控制着小批量噪声的强度。本文提出一个理论框架,用以理解小批量噪声如何影响Adam中记忆的隐式偏差(取决于$β_1$与$β_2$)使其倾向于损失曲面的更尖锐或更平坦区域,这种现象在多轮训练中常被观察到与泛化差距相关。我们发现:在大批量情况下,较高的$β_2$会增强记忆带来的反正则化强度(损害泛化能力);但随着批量减小,(反)正则化对$β_2$的依赖关系会发生逆转。类似单调性转变(方向相反)也出现在$β_1$中。特别地,常用的“默认”参数对$(β_1, β_2) = (0.9, 0.999)$在小批量场景下是良好选择;而对于较大批量,在多轮训练的许多设定中,将$β_1$调整至更接近$β_2$的值能显著提升验证准确率。此外,我们的理论推导将发生这种转变的批量规模与临界批量规模联系起来。我们通过在即将过拟合状态下的小规模数据实验中验证了这一效应。