Practitioners of a growing number of scientific and artificial-intelligence (AI) applications use High-Performance Wide-Area Networks (HP-WANs) for moving massive data sets between remote facilities. Accurate prediction of the flow completion time (FCT) is essential in these data-transfer workflows because compute and storage resources are tightly scheduled and expensive. We assess the viability of three TCP congestion control algorithms (CUBIC, BBRv1, and BBRv3) for massive data transfers over public HP-WANs, where limited control of critical data-path parameters precludes the use of Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCEv2), which is known to outperform TCP in private HP-WANs. Extensive experiments on the FABRIC testbed indicate that the configuration control limitations can also hinder TCP, especially through microburst-induced packet losses. Under these challenging conditions, we show that the highest FCT predictability is achieved by combination of BBRv1 with the application of traffic shaping before the HP-WAN entry points.
翻译:随着科学和人工智能(AI)应用日益增多,实践者开始利用高性能广域网(HP-WANs)在远程设施间传输海量数据集。由于计算和存储资源调度严格且成本高昂,在这些数据传输工作流中,准确预测流完成时间(FCT)至关重要。本文评估了三种TCP拥塞控制算法(CUBIC、BBRv1和BBRv3)在公共HP-WANs上进行大规模数据传输的可行性。在公共HP-WANs中,关键数据路径参数的可控性有限,导致无法使用在私有HP-WANs中性能优于TCP的融合以太网远程直接内存访问(RoCEv2)。基于FABRIC测试平台的大量实验表明,配置控制的限制同样会阻碍TCP性能,特别是微突发引起的丢包问题。在这些挑战性条件下,我们证明通过结合BBRv1算法并在HP-WAN入口节点前实施流量整形,能够实现最高的FCT可预测性。