The expanding hardware diversity in high performance computing adds enormous complexity to scientific software development. Developers who aim to write maintainable software have two options: 1) To use a so-called data locality abstraction that handles portability internally, thereby, performance-productivity becomes a trade off. Such abstractions usually come in the form of libraries, domain-specific languages, and run-time systems. 2) To use generic programming where performance, productivity and portability are subject to software design. In the direction of the second, this work describes a design approach that allows the integration of low-level and verbose programming tools into high-level generic algorithms based on template meta-programming in C++. This enables the development of performance-portable applications targeting host-device computer architectures, such as CPUs and GPUs. With a suitable design in place, the extensibility of generic algorithms to new hardware becomes a well defined procedure that can be developed in isolation from other parts of the code. That allows scientific software to be maintainable and efficient in a period of diversifying hardware in HPC. As proof of concept, a finite-difference modelling algorithm for the acoustic wave equation is developed and benchmarked using roofline model analysis on Intel Xeon Gold 6248 CPU, Nvidia Tesla V100 GPU, and AMD MI100 GPU.
翻译:高性能计算中不断扩大的硬件多样性给科学软件开发带来了巨大复杂性。追求编写可维护软件的开发者面临两种选择:1) 使用所谓的局部性抽象,在内部处理可移植性,但会牺牲性能与生产力之间的平衡。此类抽象通常以库、领域特定语言和运行时系统的形式呈现。2) 采用泛型编程,使性能、生产力和可移植性服从于软件设计。本文沿第二条路径,提出了一种基于C++模板元编程的设计方法,允许将底层且冗长的编程工具集成到高层泛型算法中。这使得面向主机-设备计算机架构(如CPU和GPU)的性能可移植应用成为可能。通过适当的设计,泛型算法对新硬件的扩展性可成为定义完善的开发流程,且能与其他代码部分隔离开发。这使科学软件能在HPC硬件日益多样化的时代保持可维护性与高效性。作为概念验证,我们开发了声波方程有限差分建模算法,并在Intel Xeon Gold 6248 CPU、Nvidia Tesla V100 GPU和AMD MI100 GPU上利用屋顶线模型分析进行基准测试。