The training of large-scale Mixture of Experts (MoE) models faces a critical memory bottleneck due to severe load imbalance caused by dynamic token routing. This imbalance leads to memory overflow on GPUs with limited capacity, constraining model scalability. Existing load balancing methods, which cap expert capacity, compromise model accuracy and fail on memory-constrained hardware. To address this, we propose MemFine, a memory-aware fine-grained scheduling framework for MoE training. MemFine decomposes the token distribution and expert computation into manageable chunks and employs a chunked recomputation strategy, dynamically optimized through a theoretical memory model to balance memory efficiency and throughput. Experiments demonstrate that MemFine reduces activation memory by 48.03% and improves throughput by 4.42% compared to full recomputation-based baselines, enabling stable large-scale MoE training on memory-limited GPUs.
翻译:大规模专家混合模型(MoE)的训练因动态令牌路由导致的严重负载不均衡而面临关键的内存瓶颈。这种不均衡使得容量有限的GPU出现内存溢出,制约了模型的可扩展性。现有负载均衡方法通过限制专家容量实现,但会牺牲模型精度,且在内存受限的硬件上仍可能失效。为此,我们提出MemFine——一种面向MoE训练的内存感知细粒度调度框架。MemFine将令牌分布与专家计算分解为可管理的块,并采用分块重计算策略;该策略通过理论内存模型进行动态优化,以平衡内存效率与训练吞吐量。实验表明,与基于完整重计算的基线方法相比,MemFine可减少48.03%的激活内存占用,并提升4.42%的吞吐量,从而在内存受限的GPU上实现稳定的大规模MoE训练。