Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr.Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr.Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.

翻译：高质量的内核对于可扩展的AI系统至关重要，而让大语言模型生成此类代码将推动AI发展。然而，为此任务训练大语言模型需要充足的数据、稳健的环境，且该过程常易受奖励破解与惰性优化的影响。在此类情况下，模型可能破解训练奖励，并优先考虑琐碎的正确性而非有意义的加速。本文系统性地研究了用于内核生成的强化学习。我们首先设计了KernelGYM，这是一个支持奖励破解检查、多轮交互数据收集及长期强化学习训练的稳健分布式GPU环境。基于KernelGYM，我们研究了有效的多轮强化学习方法，并识别出GRPO中因自包含导致的策略梯度偏差问题。为解决此问题，我们提出了轮级强化学习-留一估计法（TRLOO），为多轮强化学习提供无偏的优势估计。为缓解惰性优化，我们引入失配校正以提升训练稳定性，并提出基于性能剖析的奖励（PR）与基于性能剖析的拒绝采样（PRS）来克服该问题。训练得到的模型Dr.Kernel-14B在Kernelbench中达到了与Claude-4.5-Sonnet相当的性能。最后，我们研究了Dr.Kernel-14B的序列测试时扩展。在KernelBench Level-2子集上，31.6%的生成内核实现了至少1.2倍于Torch参考实现的加速，超越了Claude-4.5-Sonnet（26.7%）和GPT-5（28.6%）。当在所有轮次中选择最佳候选时，此1.2倍加速率进一步提升至47.8%。所有资源，包括环境、训练代码、模型和数据集，均包含在https://www.github.com/hkust-nlp/KernelGYM中。