High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness and achieve competitive performance across iterative refinements. We present CuTeGen, an agentic framework for automated generation and optimization of GPU kernels that treats kernel development as a structured generate--test--refine workflow. Unlike approaches that rely on one-shot generation or large-scale search over candidate implementations, CuTeGen focuses on progressive refinement of a single evolving kernel through execution-based validation, structured debugging, and staged optimization. A key design choice is to generate kernels using the CuTe abstraction layer, which exposes performance-critical structures such as tiling and data movement while providing a more stable representation for iterative modification. To guide performance improvement, CuTeGen incorporates workload-aware optimization prompts and delayed integration of profiling feedback. Experimental results on matrix multiplication and activation workloads demonstrate that the framework produces functionally correct kernels and achieves competitive performance relative to optimized library implementations.
翻译:高性能GPU内核是现代机器学习系统的基础,然而由于算法结构、存储层次使用与硬件特异性优化之间的紧密耦合,实现高效的内核仍是一项充满挑战、高度依赖专家经验的工作。近期研究探索了利用大型语言模型自动生成GPU内核,但生成的实现往往难以保持正确性,且无法在迭代优化过程中达到有竞争力的性能。我们提出CuTeGen——一个面向GPU内核自动生成与优化的智能体框架,将内核开发视为结构化的"生成-测试-优化"工作流。与依赖一次性生成或大规模搜索候选实现的方法不同,CuTeGen专注于对单一演进内核进行渐进优化,通过基于执行的验证、结构化调试和分阶段优化实现。其关键设计在于利用CuTe抽象层生成内核,该抽象层能够暴露分块与数据移动等性能关键结构,同时为迭代修改提供更稳定的表示。为引导性能提升,CuTeGen引入了负载感知的优化提示与延迟集成的性能分析反馈机制。在矩阵乘法和激活函数工作负载上的实验表明,该框架能够生成功能正确且性能媲美优化库实现的内核。