Congestion control plays a pivotal role in large-scale data centers, facilitating ultra-low latency, high bandwidth, and optimal utilization. Even with the deployment of data center congestion control mechanisms such as DCQCN and HPCC, these algorithms often respond to congestion sluggishly. This sluggishness is primarily due to the slow notification of congestion. It takes almost one round-trip time (RTT) for the congestion information to reach the sender. In this paper, we introduce the Fast Notification Congestion Control (FNCC) mechanism, which achieves sub-RTT notification. FNCC leverages the acknowledgment packet (ACK) from the return path to carry in-network telemetry (INT) information of the request path, offering the sender more timely and accurate INT. To further accelerate the responsiveness of last-hop congestion control, we propose that the receiver notifies the sender of the number of concurrent congested flows, which can be used to adjust the congested flows to a fair rate quickly. Our experimental results demonstrate that FNCC reduces flow completion time by 27.4% and 88.9% compared to HPCC and DCQCN, respectively. Moreover, FNCC triggers minimal pause frames and maintains high utilization even at 400Gbps.
翻译:拥塞控制在大型数据中心中扮演着关键角色,旨在实现超低延迟、高带宽和最优利用率。尽管已部署如DCQCN和HPCC等数据中心拥塞控制机制,这些算法对拥塞的响应往往较为迟缓。这种迟缓主要源于拥塞通知的延迟:拥塞信息需要近一个往返时间(RTT)才能到达发送端。本文提出快速拥塞通知控制(FNCC)机制,实现了亚RTT级别的通知。FNCC利用返回路径的确认数据包(ACK)携带请求路径的网络内遥测(INT)信息,为发送端提供更及时、更精确的INT数据。为了进一步加速最后一跳拥塞控制的响应速度,我们建议接收端向发送端通知并发拥塞流的数量,该信息可用于快速将拥塞流调整至公平速率。实验结果表明,与HPCC和DCQCN相比,FNCC分别将流完成时间降低了27.4%和88.9%。此外,FNCC触发的暂停帧数量极少,并在400Gbps速率下仍能保持高利用率。