With the rapid growth in computing power demand, cloud native networks have emerged as a promising solution to address the challenges of efficient resource coordination, particularly in coping with the dynamic fluctuations of network bandwidth in clusters. We propose Metronome, a network-aware and priority-aware scheduling mechanism for cloud native networks. This mechanism is designed to support jobs that exhibit periodic traffic patterns and dynamic bandwidth demands, particularly in the context of distributed training. Specifically, Metronome employs a time-division multiplexing approach that leverages job traffic characteristics to construct an elastic network resource allocation model, enabling efficient bandwidth sharing across multiple jobs. In addition, it incorporates a multi-objective optimization strategy, jointly considering latency and job priorities to achieve globally optimal as well as dynamic resource allocation. Finally, Metronome adapts to the dynamic environment by monitoring the cluster and performing reconfiguration operations. Extensive experiments with 13 common machine learning models demonstrate that Metronome can enhance cluster resource utilization while guaranteeing service performance. Compared with the existing Kubernetes scheduling mechanisms across multiple scenarios, Metronome reduces job completion time by up to 19.50% while improving average bandwidth utilization by up to 23.20%.
翻译:随着计算能力需求的快速增长,云原生网络已成为应对高效资源协调挑战、特别是应对集群中网络带宽动态波动的一种有前景的解决方案。我们提出了Metronome,一种面向云原生网络的网络感知与优先级感知调度机制。该机制旨在支持呈现周期性流量模式和动态带宽需求的作业,特别是在分布式训练的背景下。具体而言,Metronome采用时分复用方法,利用作业流量特征构建弹性网络资源分配模型,从而实现跨多个作业的高效带宽共享。此外,它结合了多目标优化策略,综合考虑延迟和作业优先级,以实现全局最优以及动态资源分配。最后,Metronome通过监控集群并执行重配置操作来适应动态环境。对13种常见机器学习模型的大量实验表明,Metronome能够在保证服务性能的同时提升集群资源利用率。与现有Kubernetes调度机制在多种场景下的对比显示,Metronome可将作业完成时间降低高达19.50%,同时将平均带宽利用率提升高达23.20%。