The scaling of the optimal AdamW weight decay hyperparameter with model and dataset size is critical as we seek to build larger models, but is poorly understood. We show that weights learned by AdamW can be understood as an exponential moving average (EMA) of recent updates. This gives critical insights for how to set the weight decay in AdamW, and how the weight decay should scale with model and dataset size. In particular, the key hyperparameter for an exponential moving average is the EMA timescale. Intuitively, the EMA timescale can be understood as the number of recent iterations the EMA averages over. We find that the optimal timescale, measured in epochs, is roughly constant as we change model and dataset size. Moreover, given a learning rate, there is a one-to-one mapping from the EMA timescale to the weight decay hyperparameter. Thus, if the optimal EMA timescale is constant, that implies that as the dataset size increases, the optimal weight decay should fall and as the model size increases, the optimal weight decay should increase (if we follow the muP recommendation for scaling the learning rate). We validate these scaling rules on ResNet-18 and Vision Transformers trained on CIFAR-10 and ImageNet, and on NanoGPT pre-training on OpenWebText. Finally, we found that as training progresses, muP's learning rate scaling breaks down for AdamW unless weight decay is scaled appropriately.
翻译:随着我们致力于构建更大规模的模型,AdamW最优权重衰减超参数随模型与数据集规模的缩放规律至关重要,但目前对此理解尚浅。本文证明,AdamW学习到的权重可被理解为近期更新的指数移动平均(EMA)。这一视角为如何设定AdamW的权重衰减以及权重衰减应如何随模型与数据集规模缩放提供了关键洞见。特别地,指数移动平均的核心超参数是其时间尺度。直观而言,EMA时间尺度可理解为EMA所平均覆盖的近期迭代次数。我们发现,以训练轮次衡量的最优时间尺度在改变模型与数据集规模时大致保持恒定。此外,给定学习率时,EMA时间尺度与权重衰减超参数存在一一映射关系。因此,若最优EMA时间尺度恒定,则意味着:当数据集规模增大时,最优权重衰减应减小;当模型规模增大时,最优权重衰减应增大(前提是遵循muP对学习率缩放的建议)。我们在CIFAR-10和ImageNet上训练的ResNet-18与Vision Transformers模型,以及在OpenWebText上进行预训练的NanoGPT上验证了这些缩放规律。最后,我们发现随着训练进行,除非权重衰减按适当方式缩放,否则muP的学习率缩放规则在AdamW优化器中会失效。