Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose Mozart, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures. On the algorithm side, Mozart exploits the inherent modularity of chiplets and introduces: (1) an expert allocation strategy that enables efficient on-package all-to-all communication, and (2) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, Mozart adaptively co-locates heterogeneous modules on specialized chiplets with a 2.5D NoP-Tree topology and hierarchical memory structure. Evaluation across three popular MoE models demonstrates significant efficiency gains, enabling more effective parallelization and resource utilization for large-scale modularized MoE-LLMs.
翻译:混合专家(MoE)架构通过模块化计算为大型语言模型(LLMs)提供了更高的效率,但其固有的稀疏性带来了显著的硬件部署挑战,包括内存局部性问题、通信开销以及计算资源利用率低下。受人类大脑模块化组织的启发,我们提出了Mozart——一种新颖的算法-硬件协同设计框架,专为在3.5D晶圆级芯粒架构上高效训练基于MoE的LLMs而定制。在算法层面,Mozart充分利用芯粒的固有模块化特性,引入了:(1)一种专家分配策略,支持高效的封装内全对全通信;(2)一种细粒度调度机制,通过流式处理令牌和专家来提升通信-计算重叠度。在架构层面,Mozart采用2.5D NoP-Tree拓扑和分层内存结构,将异构模块自适应地协同定位在专用芯粒上。在三种主流MoE模型上的评估结果表明,该框架能显著提升训练效率,为大规模模块化MoE-LLMs实现更有效的并行化与资源利用。