With the rapid growth in computing power demand, cloud native networks have emerged as a promising solution to address the challenges of efficient resource coordination, particularly in coping with the dynamic fluctuations of network bandwidth in clusters. We propose Metronome, a network-aware and priority-aware scheduling mechanism for cloud native networks. This mechanism is designed to support jobs that exhibit periodic traffic patterns and dynamic bandwidth demands, particularly in the context of distributed training. Specifically, Metronome employs a time-division multiplexing approach that leverages job traffic characteristics to construct an elastic network resource allocation model, enabling efficient bandwidth sharing across multiple jobs. In addition, it incorporates a multi-objective optimization strategy, jointly considering latency and job priorities to achieve globally optimal as well as dynamic resource allocation. Finally, Metronome adapts to the dynamic environment by monitoring the cluster and performing reconfiguration operations. Extensive experiments with 13 common machine learning models demonstrate that Metronome can enhance cluster resource utilization while guaranteeing service performance. Compared with the existing Kubernetes scheduling mechanisms across multiple scenarios, Metronome reduces job completion time by up to 19.50% while improving average bandwidth utilization by up to 23.20%.
翻译:随着计算能力需求的快速增长,云原生网络已成为应对高效资源协调挑战、特别是集群中网络带宽动态波动问题的有前景解决方案。本文提出Metronome,一种面向云原生网络的网络感知与优先级感知调度机制。该机制专为支持呈现周期性流量模式与动态带宽需求的作业而设计,尤其适用于分布式训练场景。具体而言,Metronome采用时分复用方法,利用作业流量特征构建弹性网络资源分配模型,实现多作业间的高效带宽共享。此外,该机制融合多目标优化策略,协同考虑延迟与作业优先级,以实现全局最优及动态资源分配。最后,Metronome通过监控集群并执行重配置操作来适应动态环境。基于13种常见机器学习模型的广泛实验表明,Metronome能够在保障服务性能的同时提升集群资源利用率。与现有Kubernetes调度机制在多种场景下的对比显示,Metronome最高可减少19.50%的作业完成时间,同时提升平均带宽利用率达23.20%。