Graphic Processing Units (GPUs) have become ubiquitous in scientific computing. However, writing efficient GPU kernels can be challenging due to the need for careful code tuning. To automatically explore the kernel optimization space, several auto-tuning tools - like Kernel Tuner - have been proposed. Unfortunately, these existing auto-tuning tools often do not concern themselves with integration of tuning results back into applications, which puts a significant implementation and maintenance burden on application developers. In this work, we present Kernel Launcher: an easy-to-use C++ library that simplifies the creation of highly-tuned CUDA applications. With Kernel Launcher, programmers can capture kernel launches, tune the captured kernels for different setups, and integrate the tuning results back into applications using runtime compilation. To showcase the applicability of Kernel Launcher, we consider a real-world computational fluid dynamics code and tune its kernels for different GPUs, input domains, and precisions.
翻译:摘要:图形处理单元(GPU)在科学计算中已变得无处不在。然而,由于需要精细的代码调优,编写高效的GPU内核颇具挑战性。为自动探索内核优化空间,研究人员提出了多种自动调优工具,例如Kernel Tuner。遗憾的是,这些现有自动调优工具通常不涉及将调优结果集成回应用程序的问题,这给应用开发者带来了显著的实施与维护负担。在本工作中,我们提出了Kernel Launcher:一个易用的C++库,它简化了高性能调优CUDA应用的创建过程。借助Kernel Launcher,程序员可捕获内核启动、针对不同配置调优捕获到的内核,并通过运行时编译将调优结果集成回应用程序。为展示Kernel Launcher的适用性,我们考虑了一款真实世界的计算流体动力学代码,并针对不同GPU、输入域及精度对其内核进行了调优。