We explore the performance and portability of the high-level programming models: the LLVM-based Julia and Python/Numba, and Kokkos on high-performance computing (HPC) nodes: AMD Epyc CPUs and MI250X graphical processing units (GPUs) on Frontier's test bed Crusher system and Ampere's Arm-based CPUs and NVIDIA's A100 GPUs on the Wombat system at the Oak Ridge Leadership Computing Facilities. We compare the default performance of a hand-rolled dense matrix multiplication algorithm on CPUs against vendor-compiled C/OpenMP implementations, and on each GPU against CUDA and HIP. Rather than focusing on the kernel optimization per-se, we select this naive approach to resemble exploratory work in science and as a lower-bound for performance to isolate the effect of each programming model. Julia and Kokkos perform comparably with C/OpenMP on CPUs, while Julia implementations are competitive with CUDA and HIP on GPUs. Performance gaps are identified on NVIDIA A100 GPUs for Julia's single precision and Kokkos, and for Python/Numba in all scenarios. We also comment on half-precision support, productivity, performance portability metrics, and platform readiness. We expect to contribute to the understanding and direction for high-level, high-productivity languages in HPC as the first-generation exascale systems are deployed.
翻译:我们探索了高性能计算(HPC)节点上多种高级编程模型的性能与可移植性:基于LLVM的Julia和Python/Numba,以及Kokkos。测试平台包括:采用AMD Epyc CPU和MI250X图形处理器(GPU)的Frontier测试床Crusher系统,以及橡树岭领导计算设施Wombat系统上基于安培Arm架构的CPU和NVIDIA A100 GPU。我们比较了手动实现稠密矩阵乘法算法在CPU上相较于厂商编译的C/OpenMP实现的默认性能,并在各GPU上将其与CUDA和HIP进行对比。我们并未专注于内核优化本身,而是选取这种朴素方法以模拟科学探索中的初始研究,并将其作为性能下界来隔离每种编程模型的影响。Julia和Kokkos在CPU上的性能与C/OpenMP相当,而Julia实现则在GPU上可与CUDA和HIP竞争。在NVIDIA A100 GPU上,Julia单精度计算和Kokkos存在性能差距,Python/Numba在所有场景中均有差距。我们还评述了半精度支持、生产力、性能可移植性指标及平台就绪度。我们期望此项工作能为第一代百亿亿次系统部署背景下HPC高级别、高生产力语言的理解与发展方向做出贡献。