GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.
翻译:GPU内核优化是现代深度学习的基础,但仍是需要深厚硬件专业知识的专业化任务。尽管大语言模型在通用编程方面表现强劲,但在CUDA内核生成领域仍无法与基于编译器的系统(如torch.compile)竞争。现有的CUDA代码生成方法要么依赖免训练的优化过程,要么在固定的多轮执行-反馈循环中对模型进行微调,但两种范式均未能从根本上提升模型内在的CUDA优化能力,导致性能提升有限。我们提出CUDA Agent,这是一个通过三个组件发展CUDA内核专业知识的大规模智能体强化学习系统:可扩展的数据合成流水线、具备自动验证与性能分析功能以提供可靠奖励信号的技能增强型CUDA开发环境,以及实现稳定训练的强化学习算法技术。CUDA Agent在KernelBench上取得了最先进的成果,在KernelBench Level-1、Level-2和Level-3划分上分别比torch.compile快100%、100%和92%,在最困难的Level-3设定上比Claude Opus 4.5和Gemini 3 Pro等最强专有模型领先约40%。