In this paper, we design, implement, and evaluate Polyphony, a system to give network operators a new way to control and reduce the frequency of poor tail latency events in multi-class data center networks, on the time scale of minutes. Polyphony is designed to be complementary to other adaptive mechanisms like congestion control and traffic engineering, but targets different aspects of network operation that have previously been considered static. By contrast to Polyphony, prior model-free optimization methods work best when there are only a few relevant degrees of freedom and where workloads and measurements are stable, assumptions not present in modern data center networks. Polyphony develops novel methods for measuring, predicting, and controlling network quality of service metrics for a dynamically changing workload. First, we monitor and aggregate workloads on a network-wide basis; we use the result as input to an approximate counterfactual prediction engine that estimates the effect of potential network configuration changes on network quality of service; we apply the best candidate and repeat in a closed-loop manner aimed at rapidly and stably converging to a configuration that meets operator goals. Using CloudLab on a simple topology, we observe that Polyphony converges to tight SLOs within ten minutes, and re-stabilizes after large workload shifts within fifteen minutes, while the prior state of the art fails to adapt.
翻译:本文设计、实现并评估了Polyphony系统,该系统为网络运营商提供了一种新的控制方法,可在分钟级时间尺度上降低多类数据中心网络中尾部延迟不良事件的发生频率。Polyphony旨在与拥塞控制和流量工程等其他自适应机制形成互补,但针对的是先前被视为静态的网络运行的不同方面。与Polyphony相比,先前的无模型优化方法仅在相关自由度较少且工作负载和测量值稳定的情况下表现最佳,而这些假设在现代数据中心网络中并不成立。Polyphony开发了新颖的方法,用于测量、预测和控制动态变化工作负载下的网络服务质量指标。首先,我们在全网范围内监控和聚合工作负载;将结果输入近似反事实预测引擎,以评估潜在网络配置变更对网络服务质量的影响;应用最优候选配置并以闭环方式重复该过程,旨在快速稳定地收敛至满足运营商目标的配置。通过在CloudLab简单拓扑上的实验,我们观察到Polyphony能在十分钟内收敛至严格的SLO要求,并在大规模工作负载变化后十五分钟内重新稳定,而现有最优方法则无法适应这种变化。