High-performance CUDA kernels are essential for scalable AI systems, while Large Language Models (LLMs) still struggle to generate correct kernels due to strict and implicit execution constraints. Existing LLM-based approaches either rely on costly agentic or reinforcement-learning (RL) pipelines, or adopt supervised fine-tuning (SFT) objectives that fail to explicitly model CUDA sensitivity, namely code tokens or regions tightly coupled with execution constraints. In this work, we investigate CUDA sensitivity from the perspective of token confidence patterns, showing that CUDA sensitivity appears at both token and region levels, where most CUDA-sensitive tokens are predicted with high confidence, while a smaller low-confidence subset forms regions corresponding to execution-critical structures. These findings suggest that effective CUDA kernel generation should both leverage high-confidence CUDA-sensitive tokens and preserve low-confidence CUDA-sensitive regions. Building on these insights, we propose \textbf{\underline{CU}DA-\underline{Se}nsitive Instruction \underline{T}uning (CuSeT)}, a low-cost post-training method within a simple SFT framework. CuSeT follows the principle of ``from tokens to regions'' by combining \emph{adaptive token-level masking} with \emph{region-aware sample reweighting}. Experiments show that CuSeT consistently improves functional correctness across multiple model families and scales, outperforming standard SFT and advanced SFT variants, while achieving competitive performance against frontier CUDA kernel generation models with substantially lower inference cost.
翻译:高性能CUDA内核对于可扩展的人工智能系统至关重要,而大型语言模型(LLMs)由于存在严格且隐式的执行约束,在生成正确的内核方面仍面临挑战。现有的基于LLM的方法要么依赖昂贵的代理或强化学习(RL)流水线,要么采用监督微调(SFT)目标,但未能显式建模CUDA敏感性,即与执行约束紧密耦合的代码Token或区域。本研究从Token置信度模式的角度探讨CUDA敏感性,表明CUDA敏感性同时出现在Token和区域层面,其中大多数CUDA敏感Token以高置信度被预测,而一个较小的低置信度子集则形成对应于执行关键结构的区域。这些发现表明,有效的CUDA内核生成应同时利用高置信度的CUDA敏感Token并保留低置信度的CUDA敏感区域。基于这些见解,我们提出了\textbf{\underline{CU}DA-\underline{Se}nsitive指令微调(CuSeT)},一种在简单SFT框架内的低成本后训练方法。CuSeT遵循“从Token到Region”的原则,结合了\textemph{自适应Token级掩码}与\textemph{区域感知样本重加权}。实验表明,CuSeT在多个模型家族和规模上持续改进功能正确性,优于标准SFT及高级SFT变体,同时在显著更低的推理成本下,达到了与前沿CUDA内核生成模型相竞争的性能。