GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the scarcity of high-quality data, as most high-quality kernels are proprietary and not open-source. This challenge prevents us from leveraging supervised fine-tuning to align LLMs to the kernel generation task. To address this challenge, we develop a pipeline that generates and curates high-quality CUDA kernels with reasoning traces, motivated by a critical observation that concise yet informative reasoning traces result in robust generation of high-performance kernels. Using this pipeline, we construct our dataset ConCuR and introduce our model KernelCoder, which is the first model trained on a curated dataset consisting of PyTorch, reasoning, and CUDA kernel pairs, to our knowledge. In the KernelBench setup, our model achieves significant improvements over the existing top-performing model, QwQ-32B, and outperforms all open-source models fine-tuned for kernel generation, as well as frontier models such as DeepSeek-V3.1-Think and Claude-4-sonnet. Finally, we show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks. The observations, metrics, and our data collection and curation pipeline can help obtain better data in the kernel generation task in the future.
翻译:近年来,利用大语言模型生成GPU内核的技术发展迅速,主要依赖于测试时扩展和强化学习方法。然而,内核生成面临的一个关键挑战是高质量数据的稀缺性,因为大多数高性能内核属于专有代码且未开源。这一挑战阻碍了我们通过监督微调使大语言模型适应内核生成任务。为解决该问题,我们基于一个关键观察——简洁而信息丰富的推理轨迹能够稳健地生成高性能内核,开发了一个能够生成并筛选带推理轨迹的高质量CUDA内核的流程。利用该流程,我们构建了数据集ConCuR,并提出了模型KernelCoder。据我们所知,这是首个在由PyTorch代码、推理过程和CUDA内核组成的精选数据集上训练的模型。在KernelBench评估框架下,我们的模型相较于现有性能最优的模型QwQ-32B取得了显著提升,其表现超越了所有针对内核生成进行微调的开源模型,以及DeepSeek-V3.1-Think和Claude-4-sonnet等前沿模型。最后,我们证明了平均推理长度可作为评估内核生成任务难度的指标。本研究提出的观察结论、评估指标以及数据收集与筛选流程,将为未来获取更优质的内核生成数据提供有效支持。