Sparse Mixture-of-Experts (MoE) architectures enable scaling LLM parameters under a fixed inference budget by activating only a small subset of experts via top-$k$ routing. While this preserves causality and suits autoregressive language models, the discrete top-$k$ operator is not differentiable, forcing a fixed number of active experts per input and resulting in inefficient use of computation. We propose SoftMoE, which replaces discrete routing with a truncated soft top-$k$ LapSum relaxation, allowing gradient-based optimization of expert routing. We further parameterize the mean number of active experts per layer and impose a global budget constraint, enabling the model to learn how to allocate expert capacity across layers. SoftMoE remains fully compatible with autoregressive modeling and achieves performance comparable to or better than sparse MoE on language modeling and downstream tasks, while activating significantly fewer experts. Notably, the learned allocation is highly non-uniform, with later layers activating more experts. The source code is publicly available$^\dagger$.
翻译:稀疏混合专家(MoE)架构通过基于top-$k$路由激活少量专家子集,可在固定推理预算下扩展大语言模型参数规模。该机制虽能保持因果性并适用于自回归语言模型,但离散的top-$k$算子不可微,导致每个输入需固定激活专家数量且计算利用率低下。我们提出SoftMoE,将离散路由替换为截断式软top-$k$拉普拉斯和松弛方法(LapSum),实现专家路由的梯度优化。进一步,我们参数化每层平均激活专家数并施加全局预算约束,使模型能够学习跨层分配专家容量。SoftMoE完全兼容自回归建模,在语言建模及下游任务中达到与稀疏MoE相当或更优的性能,同时显著减少专家激活数量。值得注意的是,学习到的分配呈现高度非均匀性:深层网络激活更多专家。相关源代码已公开发布$^\dagger$。