As communication protocols evolve, datacenter network utilization increases. As a result, congestion is more frequent, causing higher latency and packet loss. Combined with the increasing complexity of workloads, manual design of congestion control (CC) algorithms becomes extremely difficult. This calls for the development of AI approaches to replace the human effort. Unfortunately, it is currently not possible to deploy AI models on network devices due to their limited computational capabilities. Here, we offer a solution to this problem by building a computationally-light solution based on a recent reinforcement learning CC algorithm [arXiv:2207.02295]. We reduce the inference time of RL-CC by x500 by distilling its complex neural network into decision trees. This transformation enables real-time inference within the $\mu$-sec decision-time requirement, with a negligible effect on quality. We deploy the transformed policy on NVIDIA NICs in a live cluster. Compared to popular CC algorithms used in production, RL-CC is the only method that performs well on all benchmarks tested over a large range of number of flows. It balances multiple metrics simultaneously: bandwidth, latency, and packet drops. These results suggest that data-driven methods for CC are feasible, challenging the prior belief that handcrafted heuristics are necessary to achieve optimal performance.
翻译:随着通信协议的演进,数据中心网络利用率不断提升。这导致拥塞现象愈发频繁,进而引发更高的延迟与数据包丢失。加之工作负载复杂度的持续增加,拥塞控制(CC)算法的手动设计变得极其困难。这促使我们需要开发人工智能方法来替代人工设计。然而,由于网络设备计算能力有限,目前尚无法在其上部署AI模型。本文针对该问题提出一种解决方案:基于近期提出的强化学习拥塞控制算法[arXiv:2207.02295],构建一种计算轻量化的实现。我们通过将其复杂的神经网络提炼为决策树,将RL-CC的推理时间降低了500倍。此转换使得算法能够在$\mu$秒级决策时间要求内实现实时推理,且对控制质量的影响可忽略不计。我们将转换后的策略部署在真实集群的NVIDIA网卡上。与生产环境中常用的拥塞控制算法相比,RL-CC是在大范围流数量测试的所有基准场景中唯一表现优异的方法。它能同时平衡多项指标:带宽、延迟与丢包率。这些结果表明数据驱动的拥塞控制方法是可行的,挑战了以往认为必须通过人工设计启发式方法才能获得最优性能的固有观念。