Metronome：面向周期性流量作业的网络与优先级感知高效调度 (Metronome: Efficient Scheduling for Periodic Traffic Jobs with Network and Priority Awareness)

With the rapid growth in computing power demand, cloud native networks have emerged as a promising solution to address the challenges of efficient resource coordination, particularly in coping with the dynamic fluctuations of network bandwidth in clusters. We propose Metronome, a network-aware and priority-aware scheduling mechanism for cloud native networks. This mechanism is designed to support jobs that exhibit periodic traffic patterns and dynamic bandwidth demands, particularly in the context of distributed training. Specifically, Metronome employs a time-division multiplexing approach that leverages job traffic characteristics to construct an elastic network resource allocation model, enabling efficient bandwidth sharing across multiple jobs. In addition, it incorporates a multi-objective optimization strategy, jointly considering latency and job priorities to achieve globally optimal as well as dynamic resource allocation. Finally, Metronome adapts to the dynamic environment by monitoring the cluster and performing reconfiguration operations. Extensive experiments with 13 common machine learning models demonstrate that Metronome can enhance cluster resource utilization while guaranteeing service performance. Compared with the existing Kubernetes scheduling mechanisms across multiple scenarios, Metronome reduces job completion time by up to 19.50% while improving average bandwidth utilization by up to 23.20%.

翻译：随着计算能力需求的快速增长，云原生网络已成为应对高效资源协调挑战、特别是应对集群中网络带宽动态波动的一种有前景的解决方案。我们提出了Metronome，一种面向云原生网络的网络感知与优先级感知调度机制。该机制旨在支持呈现周期性流量模式和动态带宽需求的作业，特别是在分布式训练的背景下。具体而言，Metronome采用时分复用方法，利用作业流量特征构建弹性网络资源分配模型，从而实现跨多个作业的高效带宽共享。此外，它结合了多目标优化策略，综合考虑延迟和作业优先级，以实现全局最优以及动态资源分配。最后，Metronome通过监控集群并执行重配置操作来适应动态环境。对13种常见机器学习模型的大量实验表明，Metronome能够在保证服务性能的同时提升集群资源利用率。与现有Kubernetes调度机制在多种场景下的对比显示，Metronome可将作业完成时间降低高达19.50%，同时将平均带宽利用率提升高达23.20%。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日