The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size, we observe that nGPT does not exhibit learning rate transfer across model dimension and token horizon. To rectify this, we combine numerical experiments with a principled use of alignment exponents (arXiv:2407.05872) to revisit and modify the $μ$P approach to hyperparameter transfer (arXiv:2011.14522). The result is a novel nGPT parameterization we call $ν$GPT. Through extensive empirical validation, we find $ν$GPT exhibits learning rate transfer across width, depth, and token horizon.
翻译:归一化Transformer(简称nGPT,见arXiv:2410.01131)在训练加速方面表现优异,且无需权重衰减或学习率预热。然而,尽管其超参数明确随模型规模缩放,我们观察到nGPT并未在模型维度与词元视野上展现学习率迁移。为解决此问题,我们结合数值实验与对齐指数的原则性应用(arXiv:2407.05872),重新审视并修改了超参数迁移的$μ$P方法(arXiv:2011.14522)。由此诞生的新型nGPT参数化称为$ν$GPT。通过广泛的经验验证,我们发现$ν$GPT在宽度、深度和词元视野上均展现出学习率迁移特性。