Leveraging Graphics Processing Units (GPUs) to accelerate scientific software has proven to be highly successful, but in order to extract more performance, GPU programmers must overcome the high latency costs associated with their use. One method of reducing or hiding this latency cost is to use asynchronous streams to issue commands to the GPU. While performant, the streams model is an invasive abstraction, and has therefore proven difficult to integrate into general-purpose libraries. In this work, we enumerate the difficulties specific to library authors in adopting streams, and present recent work on addressing them. Finally, we present a unified asynchronous programming model for use in the Portable, Extensible, Toolkit for Scientific Computation (PETSc) to overcome these challenges. The new model shows broad performance benefits while remaining ergonomic to the user.
翻译:利用图形处理单元(GPU)加速科学软件已被证明非常成功,但为提取更高性能,GPU程序员必须克服与之相关的高延迟成本。隐藏或减少此延迟成本的一种方法是使用异步流向GPU发送命令。尽管该流模型性能优越,但其作为一种侵入性抽象,被证明难以集成到通用库中。本文列举了库作者在采用流时所面临的具体困难,并介绍了近期解决这些问题的相关工作。最后,我们提出了一种用于可移植可扩展科学计算工具包(PETSc)的统一异步编程模型,以应对这些挑战。该新模型在保持用户易用性的同时,展现出广泛的性能优势。