Large language models (LLMs) have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works have developed dedicated CUDA kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible. In this work, we explore the possibility of GPU native instruction optimization to further push the CUDA kernels to extreme performance. Contrary to prior works, we adopt an automatic optimization approach by defining a search space of possible GPU native instruction schedules, and then we apply stochastic search to perform optimization. Experiments show that SIP can further improve CUDA kernel throughput by automatically discovering better GPU native instruction schedules and the optimized schedules are tested by 10 million test samples.
翻译:大语言模型(LLMs)自问世以来已成为重要工作负载。然而,由于其拥有数十亿参数且需经过海量数据训练,计算开销极为昂贵。为此,近期研究致力于开发专用的LLM训练与推理CUDA内核,而非依赖编译器生成的版本,从而尽可能充分利用硬件资源。本文探索了通过GPU原生指令优化进一步将CUDA内核性能推向极致的可能性。与先前工作不同,我们采用自动优化方法:首先定义GPU原生指令调度的搜索空间,随后应用随机搜索进行优化。实验表明,SIP通过自动发现更优的GPU原生指令调度方案可进一步提升CUDA内核吞吐量,且优化后的调度方案已通过1000万测试样本验证。