A current trend in HPC systems is the utilization of architectures with SIMD or vector extensions to exploit data parallelism. There are several ways to take advantage of such modern vector architectures, each with a different impact on the code and its portability. For example, the use of intrinsics, guided vectorization via pragmas, or compiler autovectorization. Our objectives are to maximize vectorization efficiency and minimize code specialization. To achieve these objectives, we rely on compiler autovectorization. We leverage a set of hardware and software tools that allow us to analyze in detail where autovectorization is suboptimal. Thus, we apply an iterative methodology that allows us to incrementally improve the efficient use of the underlying hardware. In this paper, we apply this methodology to a CFD production code. We evaluate the performance on an innovative configurable platform powered by a RISC-V core coupled with a wide vector unit capable of operating with up to 256 double precision elements. Following the vectorization process, we demonstrate a single-core speedup of 7.6$\times$ compared to its scalar implementation. Furthermore, we show that code portability is not compromised, as our solution continues to exhibit performance benefits, or at the very least, no drawbacks, on other HPC architectures such as Intel x86 and NEC SX-Aurora.
翻译:当前高性能计算系统的发展趋势是利用具有SIMD或向量扩展的架构来开发数据并行性。利用此类现代向量架构存在多种方式,每种方式对代码及其可移植性产生不同影响,例如使用内联函数、通过编译指示引导向量化或编译器自动向量化。我们的目标是最大化向量化效率并最小化代码特化。为实现这些目标,我们依托编译器自动向量化技术,并借助一套硬件与软件工具集来精确分析自动向量化的次优环节。通过采用迭代式方法论,我们能够逐步提升底层硬件的利用效率。本文将这一方法论应用于计算流体力学生产代码,并在基于RISC-V核心的创新可配置平台上进行性能评估,该平台搭载的宽向量单元可支持多达256个双精度元素运算。经过向量化优化后,相较于标量实现版本,我们实现了单核7.6$\times$的加速比。此外,我们的解决方案在保持代码可移植性的同时,在其他高性能计算架构(如Intel x86和NEC SX-Aurora)上仍能保持性能优势或至少无性能损失。