GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability

GROMACS is a widely-used molecular dynamics software package with a focus on performance, portability, and maintainability across a broad range of platforms. Thanks to its early algorithmic redesign and flexible heterogeneous parallelization, GROMACS has successfully harnessed GPU accelerators for more than a decade. With the diversification of accelerator platforms in HPC and no obvious choice for a multi-vendor programming model, the GROMACS project found itself at a crossroads. The performance and portability requirements, and a strong preference for a standards-based solution, motivated our choice to use SYCL on both new HPC GPU platforms: AMD and Intel. Since the GROMACS 2022 release, the SYCL backend has been the primary means to target AMD GPUs in preparation for exascale HPC architectures like LUMI and Frontier. SYCL is a cross-platform, royalty-free, C++17-based standard for programming hardware accelerators. It allows using the same code to target GPUs from all three major vendors with minimal specialization. While SYCL implementations build on native toolchains, performance of such an approach is not immediately evident. Biomolecular simulations have challenging performance characteristics: latency sensitivity, the need for strong scaling, and typical iteration times as short as hundreds of microseconds. Hence, obtaining good performance across the range of problem sizes and scaling regimes is particularly challenging. Here, we share the results of our work on readying GROMACS for AMD GPU platforms using SYCL, and demonstrate performance on Cray EX235a machines with MI250X accelerators. Our findings illustrate that portability is possible without major performance compromises. We provide a detailed analysis of node-level kernel and runtime performance with the aim of sharing best practices with the HPC community on using SYCL as a performance-portable GPU framework.

翻译：GROMACS是一款广泛使用的分子动力学软件包，专注于在多种平台上的高性能、可移植性和可维护性。得益于早期算法重构和灵活的异构并行化策略，GROMACS已成功利用GPU加速器超过十年。随着HPC中加速器平台的多样化发展，且缺乏明确的跨厂商编程模型选择，GROMACS项目面临关键抉择。性能与可移植性需求，以及对基于标准解决方案的强烈偏好，促使我们选择在AMD和英特尔两款新型HPC GPU平台上使用SYCL。自GROMACS 2022版本发布以来，SYCL后端已成为支持AMD GPU的主要手段，为LUMI和Frontier等百亿亿级HPC架构做准备。SYCL是一种跨平台、免版税、基于C++17的硬件加速器编程标准，允许使用相同代码以最小定制化支持三大主要厂商的GPU。尽管SYCL实现建立在原生工具链之上，但此类方法的性能并非显而易见。生物分子模拟具有挑战性的性能特征：延迟敏感性、强扩展需求以及典型迭代时间短至数百微秒。因此，在问题规模与扩展模式的全范围内获得良好性能尤为困难。本文分享了使用SYCL为AMD GPU平台适配GROMACS的工作成果，并在配备MI250X加速器的Cray EX235a机器上展示了性能表现。我们的研究结果表明，在不显著牺牲性能的前提下实现可移植性是完全可行的。我们通过详细的节点级内核与运行时性能分析，向HPC社区分享使用SYCL作为可移植性能GPU框架的最佳实践经验。