With the rapid increase in machine learning workloads performed on HPC systems, it is beneficial to regularly perform machine learning specific benchmarks to monitor performance and identify issues. Furthermore, as part of the Edinburgh International Data Facility, EPCC currently hosts a wide range of machine learning accelerators including Nvidia GPUs, the Graphcore Bow Pod64 and Cerebras CS-2, which are managed via Kubernetes and Slurm. We extended the Reframe framework to support the Kubernetes scheduler backend, and utilise Reframe to perform machine learning benchmarks, and we discuss the preliminary results collected and challenges involved in integrating Reframe across multiple platforms and architectures.
翻译:随着高性能计算系统上机器学习工作负载的快速增长,定期执行针对机器学习的基准测试以监控性能并识别问题变得至关重要。此外,作为爱丁堡国际数据设施的一部分,EPCC目前托管了多种机器学习加速器,包括Nvidia GPU、Graphcore Bow Pod64和Cerebras CS-2,这些加速器通过Kubernetes和Slurm进行管理。我们扩展了Reframe框架以支持Kubernetes调度器后端,并利用Reframe执行机器学习基准测试,同时讨论了初步收集的结果以及在跨多个平台和架构集成Reframe过程中面临的挑战。