Large language models (LLMs) have demonstrated strong code generation capabilities, yet the runtime performance of generated code is not guaranteed, and there have been few attempts to train LLMs using runtime performance as a reward in the HPC domain. We propose an online reinforcement learning approach that executes LLM-generated code on a supercomputer and directly feeds back the measured runtime performance (GFLOPS) as a reward. We further introduce a Staged Quality-Diversity (SQD) algorithm that progressively varies the permitted optimization techniques on a per-problem basis, enabling the model to learn code optimization from diverse perspectives. We build a distributed system connecting a GPU training cluster with a CPU benchmarking cluster, and train Qwen2.5 Coder 14B on a double-precision matrix multiplication task using Group Relative Policy Optimization (GRPO). Through two experiments, we show that reinforcement learning combining runtime performance feedback with staged optimization can improve the HPC code generation capability of LLMs.
翻译:大型语言模型(LLM)已展现出强大的代码生成能力,但生成代码的运行性能难以保证,且在高性能计算领域利用运行性能作为奖励训练LLM的研究尚少。本文提出一种在线强化学习方法,在超级计算机上执行LLM生成的代码,并直接以实测运行性能(每秒浮点运算次数)作为奖励反馈。我们进一步提出分阶段质量多样性算法,该算法针对每个问题逐步调整允许使用的优化技术,使模型能够从多角度学习代码优化。我们构建了连接GPU训练集群与CPU基准测试集群的分布式系统,并采用组相对策略优化方法,在双精度矩阵乘法任务上训练Qwen2.5 Coder 14B模型。通过两项实验证明,结合运行性能反馈与分阶段优化的强化学习能够有效提升LLM的高性能计算代码生成能力。