KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Gang Liao,Hongsen Qin,Ying Wang,Alicia Golden,Michael Kuchnik,Yavuz Yetim,Jia Jiunn Ang,Chunli Fu,Yihan He,Samuel Hsia,Zewei Jiang,Dianshi Li,Uladzimir Pashkevich,Varna Puvvada,Feng Shi,Matt Steiner,Ruichao Xiao,Nathan Yan,Xiayu Yu,Zhou Fang,Roman Levenstein,Kunming Ho,Haishan Zhu,Alec Hammond,Richard Li,Ajit Mathews,Kaustubh Gondkar,Abdul Zainul-Abedin,Ketan Singh,Hongtao Yu,Wenyuan Chi,Barney Huang,Sean Zhang,Noah Weller,Zach Marine,Wyatt Cook,Carole-Jean Wu,Gaoxiang Liu

Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta's AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.

翻译：使深度学习推荐模型（DLRM）的训练与推断快速高效至关重要。然而，这带来了三个关键的系统挑战——模型架构多样性、内核原语多样性，以及硬件代际与架构异构性。本文提出KernelEvolve——一个智能内核编码框架——以应对DLRM的大规模异构性问题。KernelEvolve旨在接收内核规范作为输入，并自动化地为异构硬件架构上的推荐模型生成和优化内核。其实现方式是通过在多个编程抽象层次上操作——从Triton和CuTe领域特定语言到底层硬件无关语言，覆盖完整的硬件-软件优化栈。内核优化过程被描述为基于图的搜索，包含选择策略、通用算子、适应度函数和终止规则，并通过检索增强的提示合成动态适应运行时执行上下文。我们设计、实现并部署了KernelEvolve，用于优化多代NVIDIA和AMD GPU以及Meta自研AI加速器上的各类生产推荐模型。我们在公开可用的KernelBench测试集上验证了KernelEvolve，在三个难度级别共250个问题上实现了100%的通过率，并在三个异构硬件平台上对160个PyTorch ATen算子进行了测试，证明了其100%的正确性。KernelEvolve将开发时间从数周缩短至数小时，并在多样化的生产用例和大规模异构AI系统上，相比PyTorch基线实现了显著的性能提升。除了性能效率的改进，KernelEvolve通过为内部开发的AI硬件实现自动化内核生成，显著降低了新AI硬件的编程门槛。