Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. This work introduces a runtime framework that enables effortless programming for heterogeneous systems while efficiently utilizing hardware resources. The framework is integrated within a distributed and scalable runtime system to facilitate performance portability across heterogeneous nodes. Along with the design, this paper describes the implementation and optimizations performed, achieving up to 300% improvement on a single device and linear scalability on a node equipped with four GPUs. The framework in a distributed memory environment offers portable abstractions that enable efficient inter-node communication among devices with varying capabilities. It delivers superior performance compared to MPI+CUDA by up to 20% for large messages while keeping the overheads for small messages within 10%. Furthermore, the results of our performance evaluation in a distributed Jacobi proxy application demonstrate that our software imposes minimal overhead and achieves a performance improvement of up to 40%. This is accomplished by utilizing the optimizations mentioned earlier, as well as implementing over-decomposition in a manner that ensures performance portability.
翻译:硬件异构性在高性能计算领域将持续存在。目前大规模系统每个计算节点配备多个GPU加速器,并预计将集成更多专业化硬件。这一计算生态变革为性能提升提供了诸多机遇,但也增加了此类架构编程的复杂性。本文提出了一种运行时框架,能够在对异构系统进行便捷编程的同时高效利用硬件资源。该框架集成于分布式可扩展运行时系统中,以促进跨异构节点的性能可迁移性。除设计外,本文还描述了所实施的优化方案,在单设备上实现了高达300%的性能提升,并在配备四个GPU的节点上展现出线性可扩展性。该框架在分布式内存环境中提供了可移植抽象,支持不同能力设备间的高效节点间通信。相比MPI+CUDA,在处理大消息时性能提升高达20%,同时将小消息的开销控制在10%以内。此外,在分布式雅可比代理应用中的性能评估结果表明,我们的软件引入的开销极小,并且实现了高达40%的性能提升。这得益于前述优化策略的实施,以及通过确保性能可迁移性的方式实现过度分解。