KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBenchX, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design. Category explains nearly three times more variance in semantic correctness than method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second, iterative refinement improves correctness, but not performance. Across GEAK iterations, compile rate rises from 52.3% to 68.8% while average speedup declines from $1.58\times$ to $1.44\times$; newly rescued kernels consistently underperform persistently correct ones ($1.16\times$ vs $1.58\times$ speedup in round~0$\to$1). Third, correctness does not imply efficiency. 46.6% of correct kernels are slower than the PyTorch eager baseline, and cross-hardware speedup variance reaches $21.4\times$. Besides, quantization remains completely unsolved (0/30 successes) despite non-trivial compilation rates, revealing systematic misunderstanding of numerical computation contracts rather than surface-level syntax errors. These findings suggest that future progress depends on handling global coordination, explicitly modeling numerical precision, and incorporating hardware efficiency into generation. The code is available at https://github.com/BonnieW05/KernelBenchX

翻译：基于大语言模型的Triton内核生成已引起广泛关注，但一个基础实证问题仍未得到解答：该能力的边界在哪里，原因是什么？我们提出KernelBenchX基准，通过涵盖15个类别176个任务的类别感知评估，系统回答正确性与硬件效率问题。对五种代表性方法的系统比较得出三项主要发现。首先，任务结构比方法设计更决定正确性。类别对语义正确性的解释方差几乎是方法的3倍（解释偏差分别为9.4%与3.3%），且72%的融合任务在所有五种方法中均失败，而数学任务则稳定求解。其次，迭代优化提升正确性但不提升性能。在GEAK迭代过程中，编译率从52.3%升至68.8%，平均加速比却从1.58倍降至1.44倍；新解救的内核性能始终低于持续正确的内核（第0→1轮加速比分别为1.16倍与1.58倍）。第三，正确性不意味着高效性。46.6%的正确内核慢于PyTorch即时基线，跨硬件加速比方差高达21.4倍。此外，量化问题完全未被解决（0/30成功），尽管编译率不低，揭示了对数值计算契约的系统性误解，而非表面语法错误。这些发现表明，未来进展取决于处理全局协调、显式建模数值精度以及将硬件效率融入生成过程。代码见https://github.com/BonnieW05/KernelBenchX