Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity. Leveraging these laws, we derive the optimal training configuration for a given computational budget. Our findings not only show that MoE models consistently outperform dense Transformers but also highlight that the efficiency gap between dense and MoE models widens as we scale up the model size and training budget. Furthermore, we demonstrate that the common practice of setting the size of experts in MoE to mirror the feed-forward layer is not optimal at almost any computational budget.
翻译:混合专家(MoE)模型已成为降低大型语言模型计算成本的主要解决方案。本研究分析了其缩放特性,纳入了更广泛的变量范围。具体而言,我们引入了一个新的超参数——粒度,通过调整该参数可实现对专家规模大小的精确控制。基于此,我们建立了考虑训练token数量、模型规模及粒度的细粒度MoE缩放定律。利用这些定律,我们推导出给定计算预算下的最优训练配置。研究结果不仅表明MoE模型始终优于密集Transformer模型,还凸显出密集模型与MoE模型之间的效率差距会随着模型规模和训练预算的扩大而加剧。此外,我们证实在几乎所有计算预算下,将MoE中专家规模设置为与前馈层一致的常见做法均非最优选择。