Cloud providers have recently decentralized their wide-area network traffic engineering (TE) systems to contain the impact of TE controller failures. In the decentralized design, a controller fault only impacts its slice of the network, limiting the blast radius to a fraction of the network. However, we find that autonomous slice controllers can arrive at divergent traffic allocations that overload links by 30% beyond their capacity. We present Symphony, a decentralized TE system that addresses the challenge of divergence-induced congestion while preserving the fault-isolation benefits of decentralization. By augmenting TE objectives with quadratic regularization, Symphony makes traffic allocations robust to demand perturbations, ensuring TE controllers naturally converge to compatible allocations without coordination. In parallel, Symphony's randomized slicing algorithm partitions the network to minimize blast radius by distributing critical traffic sources across slices, preventing any single failure from becoming catastrophic. These innovations work in tandem: regularization ensures algorithmic stability to traffic allocations while intelligent slicing provides architectural resilience in the network. Through extensive evaluation on cloud provider WANs, we show Symphony reduces divergence-induced congestion by 14x and blast radius by 79% compared to current practice.
翻译:云服务提供商最近将其广域网流量工程系统去中心化,以限制流量工程控制器故障的影响。在去中心化设计中,控制器故障仅影响其网络切片,将爆炸半径限制在网络的一部分。然而,我们发现自主的切片控制器可能产生分歧性的流量分配,导致链路负载超出其容量30%。我们提出了Symphony,一个去中心化流量工程系统,它在保持去中心化故障隔离优势的同时,解决了分歧性拥塞的挑战。通过为流量工程目标函数添加二次正则化项,Symphony使流量分配对需求扰动具有鲁棒性,确保各流量工程控制器无需协调即可自然收敛到兼容的分配方案。同时,Symphony的随机切片算法通过对网络进行分区,将关键流量源分散到不同切片,以最小化爆炸半径,防止任何单一故障演变为灾难性事件。这些创新协同工作:正则化确保了流量分配的算法稳定性,而智能切片则提供了网络架构层面的弹性。通过在云提供商广域网上进行广泛评估,我们证明与当前实践相比,Symphony将分歧性拥塞减少了14倍,并将爆炸半径降低了79%。