The rapid evolution of Large Language Models (LLMs) has driven a growing demand for automated, high-performance system kernels to accelerate machine learning workloads. We introduce TritonRL, a domain-specialized 8B-scale LLM for Triton programming, trained via a novel reinforcement learning (RL) framework. While Triton synthesis faces unique challenges, including data scarcity and a high susceptibility to reward hacking, our approach enables robust kernel generation through two primary innovations. First, we implement a multi-layered verification system that provides high-fidelity reward signals, ensuring that generated kernels are both syntactically and functionally valid. Second, we propose Hierarchical Reward Decomposition (HRD), which decouples reinforcement for high-level reasoning and low-level implementation to resolve the credit assignment problem in long-sequence generation. Comprehensive evaluations on KernelBench demonstrate that TritonRL achieves state-of-the-art correctness and runtime speedup, outperforming concurrent Triton-specific models and matching the performance of frontier models with over 100B parameters. Our results highlight the effectiveness of hardware-aware RL paradigms in specialized domain adaptation.
翻译:大型语言模型(LLMs)的快速发展推动了对自动化高性能系统内核的日益增长的需求,以加速机器学习工作负载。我们提出了TritonRL,这是一个针对Triton编程的领域专用8B规模LLM,通过一种新颖的强化学习(RL)框架进行训练。尽管Triton合成面临独特挑战,包括数据稀缺性和对奖励黑客攻击的高度敏感性,但我们的方法通过两项主要创新实现了稳健的内核生成。首先,我们实现了一个多层验证系统,提供高保真度的奖励信号,确保生成的内核在语法和功能上均有效。其次,我们提出了分层奖励分解(HRD),将高层推理和底层实现的强化解耦,以解决长序列生成中的信用分配问题。在KernelBench上的综合评估表明,TritonRL在正确性和运行时加速方面达到了最先进的水平,超越了同期针对Triton的专用模型,并与超过100B参数的前沿模型性能相当。我们的结果突显了硬件感知的强化学习范式在专业领域适应中的有效性。