The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters. However, it faces the challenge of extended all-to-all communication latency during the training process. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. Yet, these methods frequently fall short of achieving sufficient overlap, consequently restricting the potential for performance enhancements. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. During the forward pass, we enable non-MoE computations to overlap with all-to-all through careful partitioning and pipelining. In the backward pass, we achieve overlap with all-to-all by scheduling gradient weight computations. We implement these techniques in Lancet, a system using compiler-based optimization to automatically enhance MoE model training. Our extensive evaluation reveals that Lancet significantly reduces the time devoted to non-overlapping communication, by as much as 77%. Moreover, it achieves a notable end-to-end speedup of up to 1.3 times when compared to the state-of-the-art solutions.
翻译:摘要:混合专家(Mixture-of-Expert, MoE)技术在扩增深度神经网络模型参数规模方面发挥着关键作用。然而,训练过程中面临全到全(all-to-all)通信延迟延长的挑战。现有方法试图通过将全到全通信与专家计算进行重叠来缓解此问题,但这些方法往往难以实现充分的重叠效果,从而限制了性能提升潜力。本研究通过考虑更广泛的训练图级别的重叠,拓展了该挑战的研究范围。在前向传播中,我们通过精细的分割和流水线技术,使非MoE计算与全到全通信实现重叠;在后向传播中,通过调度梯度权重计算实现与全到全通信的重叠。我们将这些技术实现于Lancet系统,该系统采用基于编译器的优化方法自动增强MoE模型训练。广泛评估表明,Lancet将非重叠通信时间显著减少高达77%,相较于最先进解决方案实现了最高1.3倍的端到端加速。