Over the past decade, specialized computing and storage devices, such as GPUs, TPUs, and high-speed storage, have been increasingly integrated into server nodes within Supercomputers and Data Centers. The advent of high-bandwidth memory (HBM) has facilitated a more compact design for these components, enabling multiple units to be interconnected within a single server node through intra-node networks like PCIe, NVLink, or Ethernet. These networks allow for scaling up the number of dedicated computing and storage devices per node. Additionally, inter-node networks link these devices across thousands of server nodes in large-scale computing systems. However, as communication demands among accelerators grow-especially in workloads like generative AI-both intra- and inter-node networks risk becoming critical bottlenecks. Although modern intra-node network architectures attempt to mitigate this issue by boosting bandwidth, we demonstrate in this paper that such an approach can inadvertently degrade inter-node communication. This occurs when high-bandwidth intra-node traffic interferes with incoming traffic from external nodes, leading to congestion. To evaluate this phenomenon, we analyze the communication behavior of realistic traffic patterns commonly found in generative AI applications. Using OMNeT++, we developed a general simulation model that captures both intra- and inter-node network interactions. Through extensive simulations, our findings reveal that increasing intra-node bandwidth and the number of accelerators per node can actually hinder overall inter-node communication performance rather than improve it.
翻译:过去十年间,专用计算与存储设备(如GPU、TPU及高速存储设备)日益广泛地集成于超级计算机和数据中心的服务器节点中。高带宽内存(HBM)的出现促进了这些组件的紧凑化设计,使得多个计算单元可通过PCIe、NVLink或以太网等节点内网络互连于单个服务器节点内。此类网络支持在每节点扩展专用计算与存储设备的数量。此外,节点间网络将这些设备连接于大规模计算系统中成千上万的服务器节点之间。然而,随着加速器间通信需求的增长——特别是在生成式AI等工作负载中——节点内与节点间网络均可能成为关键瓶颈。尽管现代节点内网络架构试图通过提升带宽缓解此问题,但本文证明这种方案可能无意中损害节点间通信性能。当高带宽节点内流量干扰来自外部节点的输入流量时,便会引发拥塞现象。为评估该现象,我们分析了生成式AI应用中常见实际流量模式的通信行为。利用OMNeT++构建了能同时捕捉节点内与节点间网络交互的通用仿真模型。大量仿真实验表明:提升节点内带宽及每节点加速器数量实际上可能阻碍整体节点间通信性能,而非带来改进。