The interconnection network is a key component of Supercomputers and Data centers, and its design must cope with the increasing communication demands of current applications and services; otherwise, it may become a system bottleneck. The most challenging network design issues are the topology, routing algorithm, flow control, and power efficiency. However, even the most efficient interconnection networks may suffer severe performance degradation due to congestion, especially under specific network traffic patterns generated by communication operations in high-performance computing~(HPC), deep learning training, or online data-intensive services. In this context, characterizing and modeling these communication operations and the network traffic patterns they generate is a fundamental challenge for studying their impact on network performance. This paper presents a methodology, based primarily on the VEF Traces framework, to characterize, model, and simulate the communication patterns of representative computing- and data-intensive applications. More precisely, we have extended the VEF traces framework with tools that enable us to characterize network congestion, either directly from VEF traces or via simulations. We have analyzed a set of VEF traces obtained from runs of NEST, GROMACS, LAMMPS, and PATMOS on several Supercomputers. In these studies, we identify potential congestion scenarios that arise in realistic network configurations when certain collective operations are performed.
翻译:互连网络是超级计算机和数据中心的关键组成部分,其设计必须应对当前应用和服务日益增长的通信需求,否则可能成为系统瓶颈。最具挑战性的网络设计问题包括拓扑结构、路由算法、流控机制和功耗优化。然而,即使是最高效的互连网络也可能因拥塞而遭受严重性能下降,尤其是在高性能计算(HPC)、深度学习训练或在线数据密集型服务中生成的特定网络流量模式下。在此背景下,对这些通信操作及其产生的网络流量模式进行特征化建模,是研究其对网络性能影响的基础性挑战。本文提出一种主要基于VEF Traces框架的方法论,用于对代表性计算密集型与数据密集型应用的通信模式进行特征化、建模与仿真。具体而言,我们扩展了VEF traces框架,新增了可直接通过VEF traces或仿真手段表征网络拥塞的工具。通过分析NEST、GROMACS、LAMMPS和PATMOS在多个超级计算机上运行时生成的VEF trace数据集,我们识别出在真实网络配置中执行特定集合操作时可能出现的拥塞场景。