As communication protocols evolve, datacenter network utilization increases. As a result, congestion is more frequent, causing higher latency and packet loss. Combined with the increasing complexity of workloads, manual design of congestion control (CC) algorithms becomes extremely difficult. This calls for the development of AI approaches to replace the human effort. Unfortunately, it is currently not possible to deploy AI models on network devices due to their limited computational capabilities. Here, we offer a solution to this problem by building a computationally-light solution based on a recent reinforcement learning CC algorithm [arXiv:2207.02295]. We reduce the inference time of RL-CC by x500 by distilling its complex neural network into decision trees. This transformation enables real-time inference within the $\mu$-sec decision-time requirement, with a negligible effect on quality. We deploy the transformed policy on NVIDIA NICs in a live cluster. Compared to popular CC algorithms used in production, RL-CC is the only method that performs well on all benchmarks tested over a large range of number of flows. It balances multiple metrics simultaneously: bandwidth, latency, and packet drops. These results suggest that data-driven methods for CC are feasible, challenging the prior belief that handcrafted heuristics are necessary to achieve optimal performance.
翻译:随着通信协议的演进,数据中心网络利用率不断提高,导致拥塞更加频繁,引发更高的延迟和丢包。加之工作负载日益复杂,人工设计拥塞控制算法变得极为困难。这促使人们开发人工智能方法以替代人力投入。然而,由于网络设备计算能力有限,目前尚无法在其上部署AI模型。本文提出一种解决方案:基于最新的强化学习拥塞控制算法[arXiv:2207.02295],构建计算轻量化的方法。我们通过将RL-CC的复杂神经网络蒸馏为决策树,将其推理时间降低500倍。这一转换使得在微秒级决策时限内实现实时推理成为可能,且对性能影响可忽略不计。我们将转换后的策略部署在实时集群的NVIDIA网卡上。与生产环境中广泛使用的拥塞控制算法相比,RL-CC是唯一能在大量流数范围内所有测试基准中表现优异的方法。它能同时平衡带宽、延迟和丢包等多个指标。这些结果表明,数据驱动的拥塞控制方法是可行的,挑战了以往认为必须依赖手工启发式方法才能实现最优性能的认知。