Practitioners of a growing number of scientific and artificial-intelligence (AI) applications use High-Performance Wide-Area Networks (HP-WANs) for moving massive data sets between remote facilities. Accurate prediction of the flow completion time (FCT) is essential in these data-transfer workflows because compute and storage resources are tightly scheduled and expensive. We assess the viability of three TCP congestion control algorithms (CUBIC, BBRv1, and BBRv3) for massive data transfers over public HP-WANs, where limited control of critical data-path parameters precludes the use of Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCEv2), which is known to outperform TCP in private HP-WANs. Extensive experiments on the FABRIC testbed indicate that the configuration control limitations can also hinder TCP, especially through microburst-induced packet losses. Under these challenging conditions, we show that the highest FCT predictability is achieved by combination of BBRv1 with the application of traffic shaping before the HP-WAN entry points.
翻译:暂无翻译