While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 3.00$\times$ speedup on real hardware compared with dense inference. Codes and checkpoints will be released.
翻译:混合专家(MoE)模型虽能在不按比例增加计算量的条件下扩展模型容量,但其庞大的总参数量会引发显著的存储与内存访问瓶颈,制约了在端侧同时实现高性能、低计算开销与低存储成本的高效部署。为解决上述挑战,我们提出DECO——一种稀疏MoE架构,旨在相同总参数量预算与训练数据量的前提下,达到与密集Transformer模型相当的性能。DECO采用可微分且灵活的基于ReLU的路由机制,结合可学习的专家级缩放策略,自适应平衡路由专家与共享专家的贡献。此外,我们引入激活函数NormSiLU,其在SiLU运算前对输入进行归一化处理,从而稳定路由专家激活比例的变化趋势,并提升内在稀疏程度。我们还发现,在基于ReLU的路由中使用非门控MLP专家具有经验性优势,这暗示MoE架构存在简化的可能。实验表明,DECO仅激活20%的专家即可匹配密集模型的性能,并超越现有MoE基线方法。与密集推理相比,我们的专用加速核在真实硬件上实现了3.00倍加速。相关代码与模型权重将开源。