Practitioners of a growing number of scientific and artificial-intelligence (AI) applications use High-Performance Wide-Area Networks (HP-WANs) for moving massive data sets between remote facilities. Accurate prediction of the flow completion time (FCT) is essential in these data-transfer workflows because compute and storage resources are tightly scheduled and expensive. We assess the viability of three TCP congestion control algorithms (CUBIC, BBRv1, and BBRv3) for massive data transfers over public HP-WANs, where limited control of critical data-path parameters precludes the use of Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCEv2), which is known to outperform TCP in private HP-WANs. Extensive experiments on the FABRIC testbed indicate that the configuration control limitations can also hinder TCP, especially through microburst-induced packet losses. Under these challenging conditions, we show that the highest FCT predictability is achieved by combination of BBRv1 with the application of traffic shaping before the HP-WAN entry points.
翻译:越来越多的科学和人工智能(AI)应用实践者使用高性能广域网(HP-WAN)在远程设施间传输海量数据集。在这些数据传输工作流中,准确预测流完成时间(FCT)至关重要,因为计算和存储资源调度紧凑且成本高昂。我们评估了三种TCP拥塞控制算法(CUBIC、BBRv1和BBRv3)在公共HP-WAN上进行海量数据传输的可行性,其中关键数据路径参数的控制受限,无法使用已知在私有HP-WAN中性能优于TCP的基于融合以太网的远程直接内存访问(RoCEv2)。在FABRIC测试平台上的大量实验表明,配置控制的局限性也可能阻碍TCP性能,尤其是通过微突发引发的丢包。在这些具有挑战性的条件下,我们证明,在HP-WAN入口点之前结合应用流量整形与BBRv1,可实现最高的FCT可预测性。