High-performance GPU kernels are critical to modern machine learning systems, yet developing them remains a manual, expert-driven process. Recent work has explored using LLMs to automate kernel generation, but generated kernels still fall short of carefully tuned references on standardized benchmarks. We present CuTeGen, an agentic GPU kernel synthesis framework that treats kernel development as a structured generate-test-refine workflow over the CuTe abstraction layer. Two design choices distinguish CuTeGen from prior work: targeting CuTe rather than raw CUDA, which exposes performance-critical structures such as tiling and data movement while remaining stable enough for iterative refinement, and a delayed profiling schedule that withholds low-level performance feedback until the kernel's high-level structure has stabilized. On the 209 tasks of KernelBench Level-1 and Level-2, CuTeGen achieves an average speedup of 1.71$\times$ over PyTorch and outperforms the prior agentic baseline CudaForge (0.89$\times$) at comparable per-task generation cost. Code available at https://github.com/taratt/cutegen.git
翻译:高性能GPU内核是现代机器学习系统的关键组成部分,但其开发仍依赖人工专家驱动的过程。近期研究探索了利用大语言模型(LLM)实现内核的自动化生成,然而生成的内核在标准化基准测试中仍与精心调优的参考实现存在差距。本文提出CuTeGen——一种基于智能体的GPU内核综合框架,该框架将内核开发视为基于CuTe抽象层的结构化"生成-测试-优化"工作流。与先前研究相比,CuTeGen的两项设计选择使之脱颖而出:其一是以CuTe而非原始CUDA为目标——CuTe在暴露分块与数据移动等性能关键结构的同时,保持了足够的稳定性以支持迭代优化;其二是采用延迟性能分析策略——在内核高层结构稳定前,系统性地保留底层性能反馈。在KernelBench Level-1和Level-2的209项任务中,CuTeGen相较PyTorch实现了1.71倍的平均加速比,并以相当的单任务生成成本超越了先前基于智能体的基线方法CudaForge(0.89倍加速比)。代码开源地址:https://github.com/taratt/cutegen.git