Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($μ$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $μ$P under joint width-depth scaling. Considering residual networks of varying block depths, we first introduce a spectral $μ$P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate $μ$P formulations as special cases. Building on this condition, we then derive a general recipe for implementing $μ$P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations. This approach not only recovers existing $μ$P formulations (e.g., for SGD and AdamW) but also naturally extends to a wider range of optimizers. Finally, experiments on GPT-2 style language models demonstrate that the proposed spectral $μ$P condition preserves stable feature learning and enables robust HP transfer under width-depth scaling.
翻译:生成式基础模型在宽度和深度上不断扩展,这为稳定的特征学习以及跨模型尺寸的可靠超参数(HP)迁移带来了重大挑战。虽然最大更新参数化($μ$P)已为宽度缩放下的这两个问题提供了原则性解决方案,但现有扩展到联合宽度-深度缩放机制的方法仍然零散、依赖于特定架构和优化器,并且通常基于技术复杂的理论。在本工作中,我们为联合宽度-深度缩放下的$μ$P开发了一个简单统一的谱框架。针对不同块深度的残差网络,我们首先引入了一个谱$μ$P条件,该条件精确刻画了权重及其每步更新的范数应如何随宽度和深度进行缩放,从而将先前不同的$μ$P公式统一为特例。基于此条件,我们随后通过将谱约束映射到具体的超参数参数化方案,推导出适用于广泛优化器类别的$μ$P实现通用方法。该方法不仅恢复了现有的$μ$P公式(例如,针对SGD和AdamW),而且自然地扩展到更广泛的优化器。最后,在GPT-2风格语言模型上的实验表明,所提出的谱$μ$P条件在宽度-深度缩放下保持了稳定的特征学习,并实现了鲁棒的超参数迁移。