Congestion Control (CC) plays a fundamental role in optimizing traffic in Data Center Networks (DCN). Currently, DCNs mainly implement two main CC protocols: DCTCP and DCQCN. Both protocols -- and their main variants -- are based on Explicit Congestion Notification (ECN), where intermediate switches mark packets when they detect congestion. The ECN configuration is thus a crucial aspect on the performance of CC protocols. Nowadays, network experts set static ECN parameters carefully selected to optimize the average network performance. However, today's high-speed DCNs experience quick and abrupt changes that severely change the network state (e.g., dynamic traffic workloads, incast events, failures). This leads to under-utilization and sub-optimal performance. This paper presents GraphCC, a novel Machine Learning-based framework for in-network CC optimization. Our distributed solution relies on a novel combination of Multi-agent Reinforcement Learning (MARL) and Graph Neural Networks (GNN), and it is compatible with widely deployed ECN-based CC protocols. GraphCC deploys distributed agents on switches that communicate with their neighbors to cooperate and optimize the global ECN configuration. In our evaluation, we test the performance of GraphCC under a wide variety of scenarios, focusing on the capability of this solution to adapt to new scenarios unseen during training (e.g., new traffic workloads, failures, upgrades). We compare GraphCC with a state-of-the-art MARL-based solution for ECN tuning -- ACC -- and observe that our proposed solution outperforms the state-of-the-art baseline in all of the evaluation scenarios, showing improvements up to $20\%$ in Flow Completion Time as well as significant reductions in buffer occupancy ($38.0-85.7\%$).
翻译:拥塞控制(CC)在优化数据中心网络(DCN)流量中起着基础性作用。目前,DCN主要实现两种主流CC协议:DCTCP和DCQCN。这两种协议及其主要变体均基于显式拥塞通知(ECN),中间交换机在检测到拥塞时会对数据包进行标记。因此,ECN配置是影响CC协议性能的关键因素。当前,网络专家通过精心选择静态ECN参数来优化平均网络性能。然而,当今高速DCN面临快速且突发的状态变化(如动态流量负载、incast事件、故障),导致网络利用率不足和性能次优。本文提出GraphCC——一种基于机器学习的新型网络内CC优化框架。我们的分布式解决方案融合了多智能体强化学习(MARL)与图神经网络(GNN)的创新组合,并与广泛部署的基于ECN的CC协议兼容。GraphCC在交换机上部署分布式智能体,通过邻居间的通信协作优化全局ECN配置。在评估中,我们测试了GraphCC在多种场景下的性能,重点关注该解决方案对训练中未见新场景(如新流量负载、故障、升级)的适应能力。将GraphCC与当前最先进的基于MARL的ECN调优方案ACC对比,观察到本方案在所有评估场景中均优于现有基线,流完成时间提升高达20%,缓冲区占用率显著降低(38.0-85.7%)。