Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses models's ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.
翻译:近期代码大语言模型(CodeLLMs)的发展主要集中于开放式的代码生成任务,往往忽视了代码理解与领悟这一关键维度。为弥补这一空白,我们提出了CodeMMLU——一个全面的多项选择题回答基准,旨在评估大语言模型对软件和代码的理解深度。CodeMMLU包含来自不同领域的超过10,000道问题,涵盖代码分析、缺陷检测及跨多种编程语言的软件工程原理等任务。与传统基准不同,CodeMMLU评估的是模型对代码的推理能力而非单纯生成能力,从而更深入地揭示其对复杂软件概念和系统的掌握程度。我们的大规模评估表明,即使是当前最先进的模型在应对CodeMMLU时也面临显著挑战,这凸显了它们在代码生成之外的理解能力存在不足。通过强调代码理解与有效生成之间的关键关联,CodeMMLU为推进人工智能辅助的软件开发提供了重要资源,其最终目标是构建更可靠、更强大的编程助手。