RDMA-empowered cloud services are gradually deployed across datacenters (DCs) with multiple paths, which exhibit new properties of path asymmetry, delayed congestion signals, and simultaneous flow routing collisions, and further fail existing routing methods. We present LCMP, a distributed long-haul cost-aware multi-path routing framework that aims to place RDMA flows on multiple inter-DC paths, achieving low-cost, low-latency, and congestion-responsive transmission. LCMP combines a control-plane path-quality score with compact on-switch congestion signals, where the former unifies quality assessment for asymmetric paths and the latter enables responsive reaction to path congestion. LCMP further resolves the simultaneous flow decision collision problem by filtering high-cost candidates, and performing a diversity-preserving hash inside the reduced set. On an 8-DC testbed, LCMP reduces median and tail FCT slowdown by up to 76% and 64%, respectively compared to state-of-the-art (SOTA) DCN routing strategies. And large-scale NS-3 simulations under the 2000 km inter-DC scenario confirm similar improvements.
翻译:RDMA赋能的云服务正逐步部署于具有多路径的数据中心间,这些路径展现出路径不对称、拥塞信号延迟和并发流路由冲突的新特性,进而导致现有路由方法失效。我们提出LCMP——一种分布式长距离成本感知多路径路由框架,旨在将RDMA流调度至多条数据中心间路径上,实现低成本、低时延且响应拥塞的传输。LCMP将控制面的路径质量评分与紧凑的交换机内拥塞信号相结合,前者统一了非对称路径的质量评估,后者实现了对路径拥塞的快速响应。LCMP通过过滤高成本候选路径,并在缩减后的集合内执行保留多样性的哈希,进一步解决了并发流决策冲突问题。在8数据中心测试床上,相比最先进的DCN路由策略,LCMP将中位数和尾部FCT延迟分别降低了76%和64%。在2000公里数据中心间场景下的大规模NS-3仿真证实了类似的改进。