Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $\mu$Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $\mu$-Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.
翻译:近年来,大规模语言模型(LLMs)的关注度与采用率持续增长,其中混合专家(MoE)架构已成为超大规模模型的主流设计。目前,最大的开源模型参数量已超过$1$T。在此规模下,超参数调优的成本变得极其高昂。正因如此,$\mu$Transfer正成为一项关键技术,它能够实现最优超参数在不同模型规模间的无缝迁移,从而大幅降低调优开销。然而,现有研究主要集中于稠密LLMs,尚未对MoE架构进行深入探索。本研究推导出适用于MoE的$\mu$参数化方法,为跨模型宽度的特征学习提供了理论保证。实验表明,最优学习率能够在不同模型规模间可靠迁移,这为大规模MoE模型的高效超参数调优奠定了重要基础。