Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activation similarity, which stabilizes expert utilization. To our knowledge, this is one of the first frameworks to leverage the semantic embedding capability of the router to dynamically reconfigure the model's architecture during training for substantial efficiency gains. Within each cluster, we decompose expert weights into a shared base matrix and extremely low-rank residual adapters, achieving up to fivefold parameter reduction per group while preserving specialization. This structure enables a two-stage hierarchical routing strategy: tokens are first assigned to a cluster, then to specific experts within it, drastically reducing the routing search space and the volume of all-to-all communication. Furthermore, a heterogeneous precision scheme, which stores shared bases in FP16 and residual factors in INT4, coupled with dynamic offloading of inactive clusters, reduces peak memory consumption to levels comparable to dense models. Evaluated on GLUE and WikiText-103, our framework matches the quality of standard MoE models while reducing total parameters by approximately 80%, improving throughput by 10% to 20%, and lowering expert load variance by a factor of over three. Our work demonstrates that structural reorganization is a principled path toward scalable, efficient, and memory-effective MoE LLMs. Code is available at https://github.com/szdtzpj/Breaking_the_moe_trilemma
翻译:混合专家(Mixture-of-Experts, MoE)大语言模型(LLMs)面临着负载不均衡、参数冗余和通信开销的三重困境。我们提出了一个基于动态专家聚类与结构化压缩的统一框架,以协同解决这些问题。我们的方法采用一种在线聚类过程,该过程周期性地使用参数与激活相似度的融合度量对专家进行重新分组,从而稳定专家的利用率。据我们所知,这是首个利用路由器的语义嵌入能力,在训练过程中动态重构模型架构以实现显著效率提升的框架之一。在每个聚类内部,我们将专家权重分解为一个共享的基础矩阵和极低秩的残差适配器,在保持专家专业化的同时,实现了每组参数最高五倍的缩减。这种结构支持一种两阶段的分层路由策略:首先将词元分配给一个聚类,然后在该聚类内分配给特定的专家,从而极大地减少了路由搜索空间和全对全通信量。此外,我们采用了一种异构精度方案,将共享基础矩阵以FP16格式存储,残差因子以INT4格式存储,并结合非活跃聚类的动态卸载,将峰值内存消耗降低至可与稠密模型相媲美的水平。在GLUE和WikiText-103数据集上的评估表明,我们的框架在保持标准MoE模型质量的同时,将总参数量减少了约80%,吞吐量提升了10%至20%,并将专家负载方差降低了三倍以上。我们的工作表明,结构重组是实现可扩展、高效且内存高效的MoE大语言模型的一条原则性路径。代码发布于 https://github.com/szdtzpj/Breaking_the_moe_trilemma