Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five-stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge-case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six-tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and torch.compile (max-autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy, while beating torch.compile by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel-optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available at https://github.com/RightNow-AI/autokernel.
翻译:编写高性能GPU内核是机器学习系统工程中最耗费人力的任务之一。我们提出AutoKernel,这是一个开源框架,它将自主智能体循环应用于任意PyTorch模型的GPU内核优化。给定一个模型,AutoKernel通过性能剖析识别其计算瓶颈,依据阿姆达尔定律的影响对其进行排序,并通过数百次实验迭代地优化Triton或CUDA C++内核实现,无需人工干预。一个包含烟测试、形状遍历、数值稳定性、确定性验证和边缘情况覆盖的五阶段正确性验证机制确保每个候选内核在记录加速比之前都经过验证。该系统包含超过9000行Python代码、两个后端的18个初始内核实现、一个六层级优化策略手册,以及与KernelBench基准测试套件的集成。AutoKernel涵盖九种内核类型,覆盖现代Transformer架构中的主要操作。在NVIDIA H100上,我们的Triton内核在大多数测试配置上均优于PyTorch急切执行和torch.compile(max-autotune):在RMSNorm上比急切执行快5.29倍,在softmax上快2.82倍,在交叉熵上快2.21倍,同时比torch.compile分别快2.83倍、3.44倍和2.94倍。在社区部署中,一个经AutoKernel优化的内核在vectorsum_v2 B200排行榜上取得第一名。完整系统可在https://github.com/RightNow-AI/autokernel获取。