The pursuit of high-performance data transfer often focuses on raw network bandwidth. International links of 100 Gbps or higher are frequently considered the primary enabler. While necessary, this network-centric view is incomplete. It equates provisioned link speeds with practical, sustainable data movement capabilities. It is a common observation that lower-than-desired data rates manifest even on 10 Gbps links, with higher-speed networks only amplifying their visibility. We investigate six paradigms -- from network latency and TCP congestion control to host-side factors such as CPU performance and virtualization -- that critically impact data movement workflows. These paradigms represent widely accepted engineering assumptions that inform system design, procurement decisions, and operational practices in production data movement environments. We introduce the Drainage Basin Pattern conceptual model for reasoning about end-to-end data flow constraints across heterogeneous hardware and software components at varying desired data rates to address the fidelity gap between raw bandwidth and application-level throughput. Our findings are validated through rigorous production-scale deployments, from 10 Gbps links to U.S. DOE ESnet technical evaluations and transcontinental production trials over 100 Gbps operational links. The results demonstrate that principal bottlenecks often reside outside the network core, and that a holistic hardware-software co-design enables consistent, predictable performance for demanding data transports (bulk and streaming). The key goal is to transform a demanding data transfer from a struggle with unknown outcomes into a predictable, guaranteed line-rate, routine operation that anyone can do. Another goal is to rectify the general misconception that conflates complexity with expertise.
翻译:高性能数据传输的追求往往聚焦于原始网络带宽。100 Gbps 或更高速率的国际链路常被视为主要使能因素。尽管必要,但这种以网络为中心的观点并不全面。它将预置链路速度等同于实际且可持续的数据移动能力。常见现象是,即使在 10 Gbps 链路上也表现出低于期望的数据速率,而更高速网络只是放大了其可见性。我们研究了六种范式——从网络延迟和 TCP 拥塞控制到主机端因素(如 CPU 性能和虚拟化)——这些范式对数据移动工作流产生关键影响。这些范式代表了广泛接受的工程假设,指导着生产环境中数据移动的系统设计、采购决策和运维实践。我们引入“流域盆地模式”概念模型,用于推理跨异构硬件和软件组件、在不同期望数据速率下的端到端数据流约束,以弥补原始带宽与应用层吞吐量之间的保真度差距。我们的发现通过严格的生产规模部署得到验证,涵盖从 10 Gbps 链路到美国能源部 ESnet 技术评估,以及超过 100 Gbps 运营链路的跨大陆生产试验。结果表明,主要瓶颈通常位于网络核心之外,且全面的硬件-软件协同设计能够为要求严苛的数据传输(批量与流式)实现一致且可预测的性能。关键目标在于将艰巨的数据传输——从结果未知的挣扎——转变为可预测、有保障的线速例行操作,使任何人都能完成。另一个目标是纠正将复杂性与专业性混为一谈的普遍误解。