The quality of Large Language Model (LLM) pretraining depends on multiple factors, including the compute budget and the choice of optimization algorithm. Empirical scaling laws are widely used to predict loss as model size and training data grow, however, almost all existing studies fix the optimizer (typically AdamW). At the same time, a new generation of optimizers (e.g., Muon, Shampoo, SOAP) promises faster and more stable convergence, but their relationship with model and data scaling is not yet well understood. In this work, we study scaling laws across different optimizers. Empirically, we show that 1) separate Chinchilla-style scaling laws for each optimizer are ill-conditioned and have highly correlated parameters. Instead, 2) we propose a more robust law with shared power-law exponents and optimizer-specific rescaling factors, which enable direct comparison between optimizers. Finally, 3) we provide a theoretical analysis of gradient-based methods for the proxy task of a convex quadratic objective, demonstrating that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.
翻译:大型语言模型(LLM)预训练的质量取决于多种因素,包括计算预算和优化算法的选择。经验缩放定律被广泛用于预测损失随模型规模和训练数据增长的变化,然而,几乎所有现有研究都固定了优化器(通常为AdamW)。与此同时,新一代优化器(如Muon、Shampoo、SOAP)有望实现更快、更稳定的收敛,但它们与模型和数据缩放的关系尚未得到充分理解。在本工作中,我们研究了不同优化器之间的缩放定律。实证表明:1)为每个优化器单独建立的Chinchilla式缩放定律是病态的,且其参数高度相关。相反,2)我们提出了一种更鲁棒的定律,该定律具有共享的幂律指数和优化器特定的重缩放因子,从而能够直接比较不同优化器。最后,3)我们对基于梯度的方法在凸二次目标代理任务上进行了理论分析,证明Chinchilla式缩放定律作为损失分解为不可约误差、近似误差和优化误差的自然结果而出现。