We show communication schedulers' recent work proposed for ML collectives does not scale to the increasing problem sizes that arise from training larger models. These works also often produce suboptimal schedules. We make a connection with similar problems in traffic engineering and propose a new method, TECCL, that finds better quality schedules (e.g., finishes collectives faster and/or while sending fewer bytes) and does so more quickly on larger topologies. We present results on many different GPU topologies that show substantial improvement over the state-of-the-art.
翻译:我们表明,近期针对机器学习集体通信提出的调度器工作无法适应训练更大模型时日益增长的问题规模。这些工作往往生成的调度方案也非最优。我们将其与流量工程中的类似问题建立联系,提出了一种新方法TECCL,该方法能够找到更高质量的调度方案(例如,能更快完成集体通信和/或发送更少字节),并且在更大规模拓扑上运行速度更快。我们展示了在多种不同GPU拓扑上的实验结果,这些结果表明相较于现有最优方法有显著改进。