Spectral Condition for $μ$P under Width-Depth Scaling

Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($μ$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $μ$P under joint width-depth scaling. For deep residual networks whose residual blocks contain $k$ transformations, the framework specifies how the norms of weights and their per-step updates should scale with width and depth. It reveals a fundamental transition from $k=1$ to $k\geq 2$, unifying previously disparate $μ$P formulations and identifying the $k\geq 2$ case as more appropriate for practical architectures with multi-transformation branches such as Transformers. Building on this framework, we derive a general recipe for implementing $μ$P across a broad class of optimizers by mapping spectral constraints to concrete HP parameterizations, recovering existing results and extending them to additional optimizers. Finally, experiments on GPT-2 style language models show that the $μ$P formulation derived from the $k\geq 2$ case achieves stable feature learning and robust HP transfer under width-depth scaling, whereas standard parameterization and $μ$P in the $k=1$ case often fail to do so. These results support the practical effectiveness of the proposed spectral framework.

翻译：生成式基础模型在宽度和深度上均呈递增缩放趋势，这给稳定特征学习以及跨模型尺寸的超参数迁移带来了重大挑战。虽然最大更新参数化($μ$P)已为宽度缩放的两个问题提供了原则性解决方案，但现有向宽度-深度联合缩放机制的扩展仍存在碎片化、架构与优化器特异性问题，且通常依赖技术性较强的理论。本研究针对宽度-深度联合缩放场景下的$μ$P，构建了一个简洁统一的谱框架。对于残差块包含$k$个变换的深度残差网络，该框架规定了权重范数及其每步更新量应如何随宽度和深度缩放。研究揭示了从$k=1$到$k\geq 2$的根本性转变，统一了此前分散的$μ$P公式体系，并指出$k\geq 2$情形更适用于具有多变换分支（如Transformer）的实际架构。基于该框架，通过将谱约束映射为具体超参数化形式，我们推导出适用于广泛优化器类别的通用$μ$P实现方案，既恢复了既有结果又扩展至其他优化器。最后，GPT-2风格语言模型上的实验表明，基于$k\geq 2$情形的$μ$P公式在宽度-深度缩放下实现了稳定特征学习与鲁棒超参数迁移，而标准参数化及$k=1$情形的$μ$P往往无法达成。这些结果支持了所提谱框架的实践有效性。