Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a comprehensive multiple-choice benchmark designed to evaluate the depth of software and code comprehension in LLMs. CodeMMLU includes nearly 20,000 questions spanning diverse domains, including code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks that emphasize code generation, CodeMMLU assesses a model's ability to reason about programs across a wide-range of tasks such as code repair, execution reasoning, and fill-in-the-blank challenges. Our extensive evaluation reveals that even state-of-the-art models struggle with CodeMMLU, highlighting significant gaps in comprehension beyond generation. By emphasizing the essential connection between code understanding and effective AI-assisted development, CodeMMLU provides a critical resource for advancing more reliable and capable coding assistants.
翻译:近期代码大语言模型(CodeLLMs)的研究进展主要集中于开放式代码生成,往往忽视了代码理解与推理这一关键维度。为弥补这一空白,我们提出了CodeMMLU——一个旨在评估大语言模型软件与代码理解深度的综合性多项选择题基准。CodeMMLU涵盖近20,000道题目,涉及代码分析、缺陷检测及跨多种编程语言的软件工程原理等多样化领域。与强调代码生成的传统基准不同,CodeMMLU评估模型在代码修复、执行推理和填空挑战等广泛任务中对程序进行推理的能力。我们的广泛评估表明,即使是当前最先进的模型在CodeMMLU上也表现欠佳,这揭示了超越生成能力的理解层面存在显著差距。通过强调代码理解与高效AI辅助开发之间的本质联系,CodeMMLU为推动开发更可靠、更强大的编程助手提供了关键资源。