With the rapid growth of the machine learning applications, the workloads of future HPC systems are anticipated to be a mix of scientific simulation, big data analytics, and machine learning applications. Simulation is a great research vehicle to understand the performance implications of co-running scientific applications with big data and machine learning workloads on large-scale systems. In this paper, we present Union, a workload manager that provides an automatic framework to facilitate hybrid workload simulation in CODES. Furthermore, we use Union, along with CODES, to investigate various hybrid workloads composed of traditional simulation applications and emerging learning applications on two dragonfly systems. The experiment results show that both message latency and communication time are important performance metrics to evaluate network interference. Network interference on HPC applications is more reflected by the message latency variation, whereas ML application performance depends more on the communication time.
翻译:随着机器学习应用的快速增长,未来高性能计算(HPC)系统的负载预计将是科学仿真、大数据分析和机器学习应用的混合体。仿真是理解大规模系统上科学应用与大数据及机器学习负载共行时性能影响的重要研究工具。本文提出Union——一种负载管理器,提供自动化框架以支持CODES中的混合负载仿真。此外,我们利用Union与CODES,在两种蜻蜓网络架构上研究了由传统仿真应用与新兴学习应用组成的多种混合负载。实验结果表明,消息延迟和通信时间都是评估网络干扰的重要性能指标。网络对HPC应用的干扰更多地体现在消息延迟变化上,而机器学习应用的性能则更依赖于通信时间。