Dragonfly interconnect is a crucial network technology for supercomputers. To support exascale systems, network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased, workload performance is often offset by network contention. Recently, intelligent routing built on reinforcement learning demonstrates higher network throughput with lower packet latency. However, its effectiveness in reducing workload interference is unknown. In this work, we present extensive network simulations to study multi-workload contention under different routing mechanisms, intelligent routing and adaptive routing, on a large-scale Dragonfly system. We develop an enhanced network simulation toolkit, along with a suite of workloads with distinctive communication patterns. We also present two metrics to characterize application communication intensity. Our analysis focuses on examining how different workloads interfere with each other under different routing mechanisms by inspecting both application-level and network-level metrics. Several key insights are made from the analysis.
翻译:龙网互连是超级计算机的关键网络技术。为支持百亿亿次计算系统,网络资源采用共享设计,使得链路和路由器不再专属于任何节点对。虽然链路利用率得以提升,但工作负载性能常因网络争用而受损。近期,基于强化学习的智能路由技术展现出更高的网络吞吐量和更低的数据包延迟。然而,该技术在减少工作负载干扰方面的有效性尚不明确。本研究通过大规模网络仿真,在龙网系统上对比分析智能路由与自适应路由两种机制下的多工作负载争用现象。我们开发了增强型网络仿真工具包,并构建了一套具有不同通信模式的工作负载基准测试集,同时提出了两个表征应用通信强度的指标。通过检测应用层和网络层指标,重点分析了不同路由机制下工作负载间的相互干扰模式,并从中提炼出若干关键发现。