We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.
翻译:我们在一个同时考虑特征与目标衰减的幂律随机特征(PLRF)模型下研究signSGD的缩放定律。我们分析了在高斯素描特征上使用单轮signSGD训练的线性模型的总体风险。我们将风险表达为模型大小、训练步数、学习率以及特征与目标衰减参数的函数。通过与Paquette等人(2024)分析的SGD风险进行比较,我们识别出signSGD特有的漂移归一化效应和噪声重塑效应。随后,我们在最优学习率选择下得到了计算最优的缩放定律。我们的分析表明,在噪声主导的区域中,噪声重塑效应可使signSGD的计算最优斜率比SGD更陡。最后,我们观察到,当特征衰减快而目标衰减慢时,广泛使用的预热-稳定-衰减(WSD)调度方案能进一步降低噪声项并锐化计算最优斜率。