GPUs are the heart of the latest generations of supercomputers. We efficiently accelerate a compressible multiphase flow solver via OpenACC on NVIDIA and AMD Instinct GPUs. Optimization is accomplished by specifying the directive clauses 'gang vector' and 'collapse'. Further speedups of six and ten times are achieved by packing user-defined types into coalesced multidimensional arrays and manual inlining via metaprogramming. Additional optimizations yield seven-times speedup in array packing and thirty-times speedup of select kernels on Frontier. Weak scaling efficiencies of 97% and 95% are observed when scaling to 50% of Summit and 95% of Frontier. Strong scaling efficiencies of 84% and 81% are observed when increasing the device count by a factor of 8 and 16 on V100 and MI250X hardware. The strong scaling efficiency of AMD's MI250X increases to 92% when increasing the device count by a factor of 16 when GPU-aware MPI is used for communication.
翻译:GPU已成为当代超级计算机的核心组件。本研究通过OpenACC在NVIDIA与AMD Instinct GPU上高效加速了可压缩多相流求解器。通过指定"gang vector"与"collapse"指令子句实现优化,并借助元编程将用户自定义类型打包为连续多维数组及手动内联,进一步获得6倍与10倍的加速效果。在Frontier系统上,数组打包优化实现7倍加速,特定计算内核获得30倍加速。当扩展至Summit系统50%算力与Frontier系统95%算力时,弱扩展效率分别达到97%与95%。在V100与MI250X硬件上将设备数量增加8倍与16倍时,强扩展效率分别为84%与81%。当采用GPU感知MPI进行通信时,AMD MI250X在设备数量增加16倍情况下的强扩展效率提升至92%。