Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth-$μ$P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the "magic exponent" 0.32 previously observed for AdamW. A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding $1.58\times$ compute efficiency over a strong Muon baseline at $6\times10^{21}$ FLOPs. Moreover, HyperP delivers transferable stability: all monitored instability indicators, including $Z$-values, output RMS, and activation outliers, remain bounded and non-increasing under training FLOPs scaling. We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling, and show that hypersphere optimization enables substantially larger auxiliary load-balancing weights, yielding both strong performance and good expert balance. We release our training codebase at https://github.com/microsoft/ArchScale.

翻译：大型语言模型的规模化定律关键依赖于优化器和参数化方式。现有超参数迁移律主要针对一阶优化器开发，且无法从结构上防止规模化时的训练不稳定性。近期超球优化方法将权重矩阵约束至固定范数超球面，为更稳定的规模化提供了有前景的替代方案。我们提出HyperP（超球参数化），这是首个在Frobenius球面约束下结合Muon优化器，实现跨模型宽度、深度、训练token数和混合专家（MoE）粒度的最优学习率迁移框架。我们证明权重衰减在Frobenius球面上是一阶无效操作，表明Depth-μP仍然必要，并发现最优学习率遵循与先前AdamW观测相同的“魔指数”0.32数据规模化幂律。在HyperP下，最小规模调优的单一基础学习率可跨所有计算预算迁移，在6×10²¹ FLOPs计算量下相比强Muon基线实现1.58倍计算效率。此外，HyperP提供可迁移稳定性：所有监测的不稳定性指标（包括Z值、输出RMS和激活异常值）在训练FLOPs扩展下保持有界且非增。我们还提出SqrtGate——一种从超球约束推导的MoE门控机制，可在不同MoE粒度下保持输出RMS以改进粒度扩展，并表明超球优化能显著扩大辅助负载均衡权重，同时实现强性能与良好专家平衡。我们于https://github.com/microsoft/ArchScale开源训练代码库。