Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts $k$, load imbalance affects latency and memory usage, and data-dependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism achieving $O(1)$ communication cost regardless of $k$, completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to $1.61\times$ faster while having identical performance. With doubled granularity, it achieves higher overall performance while still being $1.11\times$ faster. Our method makes multi-billion-parameter foundation model research more accessible.
翻译:大语言模型已改变众多应用领域,但其训练成本依然高昂。稀疏专家混合模型通过条件计算解决此问题,其中专家并行作为标准的分布式训练方法。然而,专家并行存在三个局限性:通信成本随激活专家数$k$线性增长,负载不均衡影响延迟与内存使用,数据依赖型通信需要元数据交换。我们提出多头潜在专家混合与头并行——一种新型架构与并行范式,可在保持与专家并行兼容的同时,实现与$k$无关的$O(1)$通信成本、完全均衡的通信流量以及确定性通信。为加速多头潜在专家混合,我们提出I/O感知路由与专家计算机制。相较于采用专家并行的专家混合模型,配备头并行的多头潜在专家混合训练速度提升达$1.61\times$且性能完全一致。在粒度加倍的情况下,该方法在仍保持$1.11\times$加速的同时实现了更高的整体性能。本方法使数十亿参数基础模型的研究更具可行性。