Some faults in data center networks require hours to days to repair because they may need reboots, re-imaging, or manual work by technicians. To reduce traffic impact, cloud providers \textit{mitigate} the effect of faults, for example, by steering traffic to alternate paths. The state-of-art in automatic network mitigations uses simple safety checks and proxy metrics to determine mitigations. SWARM, the approach described in this paper, can pick orders of magnitude better mitigations by estimating end-to-end connection-level performance (CLP) metrics. At its core is a scalable CLP estimator that quickly ranks mitigations with high fidelity and, on failures observed at a large cloud provider, outperforms the state-of-the-art by over 700$\times$ in some cases.
翻译:数据中心网络中的某些故障需要数小时甚至数天才能修复,其原因可能涉及重启、重镜像或技术人员的人工操作。为减少流量影响,云提供商通过缓解故障效应(例如将流量引导至备用路径)来降低影响。当前自动网络缓解技术的最高水平依赖简单的安全检查与代理指标来决定缓解措施。本文所描述的SWARM方法通过估算端到端连接级性能(CLP)指标,能够选出数个数量级更优的缓解方案。其核心是可扩展的CLP估算器,可高保真度地快速排序缓解方案,并在大型云提供商的实际故障案例中,某些情况下的性能比当前最优方案提升超过700倍。