We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters. CASSINI introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, CASSINI uses an affinity graph that finds a series of time-shift values to adjust the communication phases of a subset of jobs, such that the communication patterns of jobs sharing the same network link are interleaved with each other. Experiments with 13 common ML models on a 24-server testbed demonstrate that compared to the state-of-the-art ML schedulers, CASSINI improves the average and tail completion time of jobs by up to 1.6x and 2.5x, respectively. Moreover, we show that CASSINI reduces the number of ECN marked packets in the cluster by up to 33x.
翻译:我们提出CASSINI——一种面向机器学习(ML)集群的网络感知作业调度器。CASSINI引入了一种新颖的几何抽象,在将不同作业部署到网络链路时考虑其通信模式。为此,CASSINI利用亲和图计算一系列时间偏移值,调整部分作业的通信相位,使得共享同一网络链路的作业的通信模式相互交错。在包含24台服务器的测试平台上,针对13种常见ML模型的实验表明:与当前最先进的ML调度器相比,CASSINI将作业的平均完成时间和尾延迟分别提升至多1.6倍和2.5倍。此外,我们证明CASSINI将集群中ECN标记数据包的数量减少至多33倍。