With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters $(β_1, β_2)$ controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on $β_1$, $β_2$) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher $β_2$ increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regulariation on $β_2$ is reversed. A similar monotonicity shift (in the opposite direction) happens in $β_1$. In particular, the commonly "default" pair $(β_1, β_2) = (0.9, 0.999)$ is a good choice if batches are small; for larger batches, in many settings moving $β_1$ closer to $β_2$ is much better in terms of validation accuracy in multi-epoch training. Moreover, our theoretical derivations connect the scale of the batch size at which the shift happens to the scale of the critical batch size. We illustrate this effect in experiments with small-scale data in the about-to-overfit regime.
翻译:随着高质量数据的稀缺和计算能力的增长,多轮次训练在深度学习各子领域中重新获得重要地位。Adam(W)(其变体是许多任务(如下一个词元预测)的首选优化器)具有两个控制记忆力的动量超参数$(β_1, β_2)$,以及一个非常重要的超参数——批量大小,该参数(尤其)控制小批量噪声的量。我们引入了一个理论框架,以理解小批量噪声如何影响Adam(取决于$β_1$、$β_2$)中记忆力的隐式偏置,使其偏向损失景观中更尖锐或更平坦的区域,这一现象通常与多轮次训练中的泛化差距相关。我们发现,在批量较大的情况下,较高的$β_2$会增强记忆力的反正则化幅度(损害泛化能力);但当批量变小时,反正则化对$β_2$的依赖性会发生逆转。类似的单调性转变(方向相反)也出现在$β_1$中。具体而言,通常的“默认”超参数对$(β_1, β_2) = (0.9, 0.999)$在批量较小时是一个良好选择;而对于较大批量,在许多设置中将$β_1$向$β_2$靠近能在多轮次训练中获得更好的验证精度。此外,我们的理论推导将发生这种转变的批量大小尺度与临界批量大小尺度联系起来。我们通过小规模数据在接近过拟合状态下的实验验证了这一效应。