Cloud computing and AI workloads are driving unprecedented demand for efficient communication within and across datacenters. However, the coexistence of intra- and inter-datacenter traffic within datacenters plus the disparity between the RTTs of intra- and inter-datacenter networks complicates congestion management and traffic routing. Particularly, faster congestion responses of intra-datacenter traffic causes rate unfairness when competing with slower inter-datacenter flows. Additionally, inter-datacenter messages suffer from slow loss recovery and, thus, require reliability. Existing solutions overlook these challenges and handle inter- and intra-datacenter congestion with separate control loops or at different granularities. We propose Uno, a unified system for both inter- and intra-DC environments that integrates a transport protocol for rapid congestion reaction and fair rate control with a load balancing scheme that combines erasure coding and adaptive routing. Our findings show that Uno significantly improves the completion times of both inter- and intra-DC flows compared to state-of-the-art methods such as Gemini.
翻译:云计算与人工智能工作负载正推动着数据中心内部及跨数据中心高效通信的迫切需求。然而,数据中心内同时存在数据中心内与数据中心间流量,加之数据中心内网络与数据中心间网络在往返时延(RTT)上的显著差异,使得拥塞管理与流量路由变得复杂。特别是,数据中心内流量更快的拥塞响应在与较慢的数据中心间流竞争时会导致速率不公平性。此外,数据中心间消息因丢包恢复缓慢而需要可靠性保障。现有解决方案忽视了这些挑战,通常采用独立的控制环路或以不同粒度分别处理数据中心间与数据中心内的拥塞。我们提出Uno,一个面向数据中心间与数据中心内环境的统一系统,它集成了用于快速拥塞响应与公平速率控制的传输协议,以及结合了擦除编码与自适应路由的负载均衡方案。我们的研究结果表明,与Gemini等先进方法相比,Uno显著提升了数据中心间与数据中心内流的完成时间。