CodegenBench: Can LLMs Write Efficient Code Across Architectures?

While large language models (LLMs) have been extensively evaluated on code generation tasks for general-purpose programming and GPU-accelerated environments (e.g., PyTorch, CUDA), their capabilities in CPU-oriented high-performance computing (HPC) across diverse architectures remain underexplored. To bridge this gap, we introduce CodegenBench, a comprehensive benchmark suite designed to evaluate the generation of efficient parallel code across three distinct hardware platforms: x86_64, Sunway, and Kunpeng. Our benchmark comprises 106 standard Basic Linear Algebra Subprograms (BLAS) routines establishing a fundamental baseline, alongside 20 specialized computational kernels adapted for each of the unique supercomputing architectures (LeetSunway and LeetKunpeng). Our extensive evaluation reveals that while state-of-the-art LLMs can generate optimized code for ubiquitous architectures like x86_64, they exhibit significant performance degradation on domain-specific architectures with limited public documentation and training data, highlighting critical limitations in cross-platform generalization. Furthermore, our analysis of factors influencing code quality such as implementation length and task complexity indicates that current LLMs are most effective for moderately difficult problems requiring concise code snippets. We open-source our dataset and automated evaluation infrastructure to facilitate future research in LLM-driven high-performance code generation. The resources are available at https://anonymous.4open.science/r/CodegenBench-EDE1/ and https://anonymous.4open.science/r/CodegenBenchDataset-2551.

翻译：尽管大型语言模型（LLMs）已在通用编程和GPU加速环境（如PyTorch、CUDA）的代码生成任务中得到广泛评估，但其在面向CPU的高性能计算（HPC）领域跨不同架构的能力仍待深入探索。为填补这一空白，我们提出了CodegenBench——一个综合基准测试套件，旨在评估跨x86_64、申威和鲲鹏三种不同硬件平台的高效并行代码生成能力。我们的基准测试包含106个标准基本线性代数子程序（BLAS）例程，构建了基础基线，以及20个针对每种独特超算架构（LeetSunway和LeetKunpeng）适配的专业计算核心。广泛评估显示：虽然最先进的LLMs能为x86_64等通用架构生成优化代码，但在公共文档和训练数据有限的特定领域架构上出现显著性能退化，凸显了跨平台泛化的关键局限性。此外，我们对影响代码质量的因素（如实现长度和任务复杂度）分析表明，当前LLMs在处理需要简洁代码片段的中等难度问题时最为有效。我们开源了数据集和自动化评估基础设施，以促进LLM驱动的高性能代码生成领域的未来研究。资源可通过以下链接获取：https://anonymous.4open.science/r/CodegenBench-EDE1/ 和 https://anonymous.4open.science/r/CodegenBenchDataset-2551。