High-performance computing (HPC) systems increasingly support both scalable AI training and large-scale simulation workloads. Both typically rely heavily on collective communication operations. On modern supercomputers, however, network congestion has emerged as a major limitation, driven by heterogeneous traffic patterns resulting from diverse workload mixes. As system scale and active users continue to grow, understanding how today's interconnect technologies respond to congestion is essential for establishing realistic performance expectations and informing future system design. This paper presents a comprehensive characterization of congestion behavior across four major HPC fabrics: EDR InfiniBand, HDR InfiniBand, NDR InfiniBand, Cray Slingshot, and emerging Ethernet fabrics. These fabrics span high-performance proprietary interconnects as well as adaptive Ethernet-based designs aligned with emerging standards such as Ultra Ethernet. We evaluate their responses to both steady congestion and a wide range of bursty patterns that vary in duration, intensity, and pause length, capturing the bursty communication typical of AI workloads. Our study covers multiple scales, examining how congestion manifests differently as system size increases and identifying scale-dependent behaviors that influence collective performance. By analyzing the challenges that arise under these controlled stress conditions, we aim to provide a practical overview of congestion issues and possible optimizations. The insights derived from this evaluation can guide researchers and HPC architects in designing more effective congestion-control mechanisms and network load-balancing strategies.
翻译:高性能计算(HPC)系统日益支持可扩展的人工智能训练与大规模仿真任务,两者通常高度依赖集合通信操作。然而在现代超级计算机中,由于多样混合工作负载产生的异构流量模式,网络拥塞已成为主要瓶颈。随着系统规模与活跃用户持续增长,理解当前互连技术如何应对拥塞对于建立实际性能预期和指导未来系统设计至关重要。本文对四种主流HPC架构(EDR InfiniBand、HDR InfiniBand、NDR InfiniBand、Cray Slingshot)及新兴以太网架构的拥塞行为进行了全面分析。这些架构涵盖高性能专有互连方案以及符合超以太网等新兴标准的自适应以太网设计。我们评估了这些架构对持续拥塞和多种突发模式的响应,这些突发模式在持续时间、强度和停顿间隔上各有不同,从而捕捉了AI工作负载典型的突发通信特征。研究涵盖多规模场景,考察拥塞随系统扩展的不同表现方式,并识别影响集合性能的尺度相关行为。通过分析这些受控压力场景下的挑战,本研究旨在提供拥塞问题与优化可能的实践性综述。这些评估所得结论可指导研究人员与HPC架构师设计更有效的拥塞控制机制与网络负载均衡策略。