Large language models (LLMs) are remarked by their substantial computational requirements. To mitigate the cost, researchers develop specialized CUDA kernels, which often fuse several tensor operations to maximize the utilization of GPUs as much as possible. However, those specialized kernels may still leave performance on the table as CUDA assembly experts show that manual optimization of GPU SASS schedules can lead to better performance, and trial-and-error is largely employed to manually find the best GPU SASS schedules. In this work, we employ an automatic approach to optimize GPU SASS schedules, which thus can be integrated into existing compiler frameworks. The key to automatic optimization is training an RL agent to mimic how human experts perform manual scheduling. To this end, we formulate an assembly game, where RL agents can play to find the best GPU SASS schedules. The assembly game starts from a \textit{-O3} optimized SASS schedule, and the RL agents can iteratively apply actions to mutate the current schedules. Positive rewards are generated if the mutated schedules get higher throughput by executing on GPUs. Experiments show that CuAsmRL can further improve the performance of existing specialized CUDA kernels transparently by up to $26\%$, and on average $9\%$. Moreover, it is used as a tool to reveal potential optimization moves learned automatically.
翻译:大型语言模型因其巨大的计算需求而备受关注。为降低计算成本,研究者开发了专门的CUDA内核,通常融合多个张量操作以尽可能提高GPU利用率。然而,这些专用内核仍可能存在性能提升空间,因为CUDA汇编专家表明,手动优化GPU SASS调度方案可获得更好性能,而当前主要依赖试错法人工寻找最佳GPU SASS调度方案。本研究采用自动化方法优化GPU SASS调度,从而可集成至现有编译器框架。自动优化的关键在于训练强化学习智能体模拟人类专家进行手动调度的过程。为此,我们构建了汇编游戏框架,强化学习智能体可通过博弈寻找最优GPU SASS调度方案。该游戏以\textit{-O3}优化级别的SASS调度方案为起点,强化学习智能体可迭代执行操作以变异当前调度方案。若变异后的调度方案在GPU上执行获得更高吞吐量,则生成正向奖励。实验表明,CuAsmRL可透明化地将现有专用CUDA内核性能最高提升$26\%$,平均提升$9\%$。此外,该工具还能自动揭示潜在的可学习优化策略。