Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. This work introduces a runtime framework that enables effortless programming for heterogeneous systems while efficiently utilizing hardware resources. The framework is integrated within a distributed and scalable runtime system to facilitate performance portability across heterogeneous nodes. Along with the design, this paper describes the implementation and optimizations performed, achieving up to 300% improvement on a single device and linear scalability on a node equipped with four GPUs. The framework in a distributed memory environment offers portable abstractions that enable efficient inter-node communication among devices with varying capabilities. It delivers superior performance compared to MPI+CUDA by up to 20% for large messages while keeping the overheads for small messages within 10\%. Furthermore, the results of our performance evaluation in a distributed Jacobi proxy application demonstrate that our software imposes minimal overhead and achieves a performance improvement of up to 40%. This is accomplished by the optimizations at the library level as well as by creating opportunities to leverage application-specific optimizations like over-decomposition.
翻译:硬件异构性在高性能计算领域将持续存在。当前大规模系统每个计算节点配备多个GPU加速器,未来预计将集成更多专用硬件。这种计算生态系统的变革为性能提升提供了诸多机遇,但也增加了此类架构的编程复杂度。本文提出一种运行时框架,能够在高效利用硬件资源的同时,实现对异构系统的简便编程。该框架集成于分布式可扩展运行时系统中,以促进跨异构节点的性能可移植性。除框架设计外,本文还描述了其实现与优化方案:在单个设备上实现高达300%的性能提升,并在配备四个GPU的节点上实现线性可扩展性。该框架在分布式内存环境中提供可移植抽象层,支持具有不同计算能力的设备间高效节点间通信。在大消息传输场景下,其性能较MPI+CUDA提升最高达20%,同时将小消息开销控制在10%以内。此外,基于分布式Jacobi代理应用的性能评估结果表明,本软件引入的开销极低,并通过库级优化与过度分解等应用级优化策略,实现了最高40%的性能提升。