Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-$k$+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-$k$+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.
翻译:专家混合(Mixture-of-Experts, MoE)模型在大规模语言模型中展现出卓越性能。现有路由机制通常依赖于不可微的Top-$k$+Softmax操作,这限制了其性能与可扩展性。我们认为,标准Top-$k$+Softmax将两个本质不同的决策——激活哪些专家以及如何在被选专家间分配贡献度——进行了不当耦合。本文提出狄利克雷路由MoE(DirMoE),这是一种基于狄利克雷变分自编码器框架构建的端到端可微分路由机制。该设计从根本上解耦了路由的核心问题:专家选择(通过伯努利组件建模)与被选专家间的贡献分配(通过狄利克雷组件处理)。整个前向传播过程通过使用Gumbel-Sigmoid松弛处理专家选择,并结合狄利克雷分布的隐式重参数化技术,实现了完全可微性。我们的训练目标采用变分证据下界(ELBO),其中包含直接稀疏性惩罚项,可精确控制期望激活的专家数量;同时设计了关键超参数的调度策略,引导模型从探索性路由状态逐步过渡到确定性路由状态。此外,DirMoE路由机制在保持或超越其他方法性能的同时,有效提升了专家专业化程度。