Fine-Tuning GPT-5 for GPU Kernel Generation

Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.

翻译：开发高效的GPU内核对于扩展现代人工智能系统至关重要，但由于复杂的硬件架构和需要专门的优化专业知识，这仍然是一项复杂的任务。尽管大型语言模型（LLM）在通用顺序代码生成方面展现出强大的能力，但在GPU代码生成方面面临重大挑战，原因包括高质量标注训练数据的稀缺、生成合成解决方案时的编译器偏差，以及跨硬件代际的泛化能力有限。这阻碍了监督微调（SFT）作为改进当前LLM的可扩展方法。相比之下，强化学习（RL）提供了一种数据高效且自适应的替代方案，但需要访问相关工具、精心选择训练问题以及稳健的评估环境。我们提出了Makora的强化学习微调前沿模型的环境和工具，并报告了微调GPT-5用于Triton代码生成的结果。在单次尝试设置中，与基线GPT-5相比，我们的微调模型将内核正确率从43.7%提升至77.0%（+33.3个百分点），并将优于TorchInductor的问题比例从14.8%提升至21.8%（+7个百分点），同时在KernelBench上超越了先前的最先进模型。当集成到完整的编码代理中时，它能够在扩展的KernelBench套件中解决高达97.4%的问题，在72.9%的问题上优于PyTorch TorchInductor编译器，几何平均加速比为2.12倍。我们的工作表明，通过强化学习进行有针对性的后训练可以释放LLM在高度专业化技术领域的能力，这些领域传统监督学习受限于数据可用性，从而为AI辅助加速器编程开辟了新途径。