Overlapping communication with computation is crucial for distributed large-model training, yet optimizing it - especially when computation becomes the bottleneck-remains challenging. We present Lagom, a system that co-tunes communication parameters to balance resource usage between computation and communication. By introducing a unified cost model and a priority-based search algorithm, Lagom reduces optimization complexity from exponential to linear. Evaluations on high- and low-bandwidth GPU clusters show that Lagom achieves 1.07-1.33x and 1.03-1.27x speedup over NCCL and AutoCCL across diverse models and parallelizations.
翻译:在分布式大模型训练中,通信与计算的重叠至关重要,然而其优化——尤其是在计算成为瓶颈时——仍具挑战性。我们提出了Lagom,一个通过协同调优通信参数以平衡计算与通信资源使用的系统。通过引入统一的成本模型和基于优先级的搜索算法,Lagom将优化复杂度从指数级降至线性级。在高带宽与低带宽GPU集群上的评估表明,在不同模型与并行化策略下,Lagom相比NCCL和AutoCCL分别实现了1.07-1.33倍与1.03-1.27倍的加速。