Optimizing GPU kernels manually is a challenging and time-consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM-driven automated optimization methods narrowly focus on machine learning applications, such as PyTorch operator optimization, while overlooking broader domains like sparse matrix operations in scientific computing. Extending to these broader applications brings new challenges for the benchmark and algorithm. Therefore, developing a general-purpose automated kernel optimization method becomes our primary focus. In this paper, we address the absence of systematic evaluation for multi-scenario settings by introducing MSKernelBench, which spans multiple scenarios, including fundamental algebraic operations, common LLM kernels, sparse matrix operators, and scientific computing routines, each supporting both FP32 and BF16 precision. Building on this benchmark, we introduce CUDAMaster, a multi-agent, hardware-aware system for kernel optimization that leverages profiling information and automatically constructs the full compilation and execution toolchain. Experimental results demonstrate that CUDAMaster achieves significant speedups across most operators, outperforming Astra by about 35%. In several cases, its performance matches or surpasses that of highly optimized, closed-source libraries such as cuBLAS. A demo showcasing the original and optimized code for each operator is available at https://hanyx2021.github.io/MSKernelBenchDemo/.
翻译:手动优化GPU内核是一项具有挑战性且耗时的任务。随着LLMs的快速发展,自动化的GPU内核优化正逐渐成为可实现的现实。然而,当前基于LLM的自动化优化方法主要局限于机器学习应用,例如PyTorch算子优化,而忽视了更广泛的领域,如科学计算中的稀疏矩阵运算。扩展到这些更广泛的应用领域为基准测试和算法带来了新的挑战。因此,开发一种通用的自动化内核优化方法成为我们的主要关注点。在本文中,我们通过引入MSKernelBench来解决多场景设置缺乏系统性评估的问题。该基准测试涵盖多个场景,包括基础代数运算、常见LLM内核、稀疏矩阵算子以及科学计算例程,每个场景均支持FP32和BF16两种精度。基于此基准测试,我们提出了CUDAMaster,这是一个多智能体、硬件感知的内核优化系统,它利用性能分析信息并自动构建完整的编译与执行工具链。实验结果表明,CUDAMaster在大多数算子上实现了显著的加速,性能比Astra高出约35%。在多个案例中,其性能达到甚至超越了高度优化的闭源库(如cuBLAS)的水平。每个算子的原始代码与优化代码展示可访问:https://hanyx2021.github.io/MSKernelBenchDemo/。