KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBench-X, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design. Category explains nearly three times more variance in semantic correctness than method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second, iterative refinement improves correctness, but not performance. Across GEAK iterations, compile rate rises from 52.3% to 68.8% while average speedup declines from $1.58\times$ to $1.44\times$; newly rescued kernels consistently underperform persistently correct ones ($1.16\times$ vs $1.58\times$ speedup in round~0$\to$1). Third, correctness does not imply efficiency. 46.6% of correct kernels are slower than the PyTorch eager baseline, and cross-hardware speedup variance reaches $21.4\times$. Besides, quantization remains completely unsolved (0/30 successes) despite non-trivial compilation rates, revealing systematic misunderstanding of numerical computation contracts rather than surface-level syntax errors. These findings suggest that future progress depends on handling global coordination, explicitly modeling numerical precision, and incorporating hardware efficiency into generation. The code is available at https://github.com/BonnieW05/KernelBenchX

翻译：基于大语言模型的Triton内核生成已引起广泛关注，但一个基础实证问题仍未得到解答：这种能力在何处失效，以及为何失效？我们提出KernelBench-X——一个旨在通过15个类别176项任务中正确性与硬件效率的类别感知评估来回答该问题的基准。通过对五种代表性方法的系统比较，我们获得三项主要发现。首先，任务结构对正确性的影响大于方法设计。类别解释的语义正确性方差约为方法的近三倍（解释偏差9.4% vs 3.3%），且72%的融合任务在所有五种方法中均失败，而数学任务则被一致解决。其次，迭代优化提升正确性但未改善性能。在GEAK迭代过程中，编译率从52.3%升至68.8%，而平均加速比从$1.58\times$降至$1.44\times$；新增成功的内核始终弱于持续正确的内核（第0→1轮加速比$1.16\times$ vs $1.58\times$）。第三，正确性不意味着高效性。46.6%的正确内核慢于PyTorch即时基线，跨硬件加速比方差达$21.4\times$。此外，量化问题在非平凡编译率下仍完全未解决（0/30成功），揭示了数值计算契约的系统性误解而非表面语法错误。这些发现表明，未来进展取决于全局协调处理、数值精度显式建模以及硬件效率融入生成过程。代码见https://github.com/BonnieW05/KernelBenchX