Congestion control plays a pivotal role in large-scale data centers, facilitating ultra-low latency, high bandwidth, and optimal utilization. Even with the deployment of data center congestion control mechanisms such as DCQCN and HPCC, these algorithms often respond to congestion sluggishly. This sluggishness is primarily due to the slow notification of congestion. It takes almost one round-trip time (RTT) for the congestion information to reach the sender. In this paper, we introduce the Fast Notification Congestion Control (FNCC) mechanism, which achieves sub-RTT notification. FNCC leverages the acknowledgment packet (ACK) from the return path to carry in-network telemetry (INT) information of the request path, offering the sender more timely and accurate INT. To further accelerate the responsiveness of last-hop congestion control, we propose that the receiver notifies the sender of the number of concurrent congested flows, which can be used to adjust the congested flows to a fair rate quickly. Our experimental results demonstrate that FNCC reduces flow completion time by 27.4% and 88.9% compared to HPCC and DCQCN, respectively. Moreover, FNCC triggers minimal pause frames and maintains high utilization even at 400Gbps.
翻译:拥塞控制在大规模数据中心中扮演着关键角色,有助于实现超低延迟、高带宽和最优利用率。即使部署了DCQCN和HPCC等数据中心拥塞控制机制,这些算法对拥塞的响应往往仍较为迟缓。这种迟缓主要归因于拥塞通知的缓慢:拥塞信息需要几乎一个往返时间(RTT)才能到达发送端。本文提出快速通知拥塞控制(FNCC)机制,该机制可实现亚RTT级别的通知。FNCC利用返回路径中的确认数据包(ACK)携带请求路径的网络内遥测(INT)信息,为发送端提供更及时和准确的INT。为进一步加速末跳拥塞控制的响应性,我们提出由接收端向发送端通知并发拥塞流的数量,从而快速将拥塞流调整至公平速率。实验结果表明,与HPCC和DCQCN相比,FNCC分别将流完成时间降低了27.4%和88.9%。此外,FNCC触发的暂停帧极少,即使在400Gbps速率下也能保持高利用率。