Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses models's ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.
翻译:近期代码大语言模型(CodeLLMs)的发展主要集中于开放式代码生成任务,往往忽视了代码理解与领会这一关键维度。为弥补这一空白,我们提出了CodeMMLU——一个全面的多项选择题问答基准,旨在评估大语言模型对软件与代码的理解深度。CodeMMLU包含来自多个领域的逾10,000道问题,涵盖代码分析、缺陷检测及跨多种编程语言的软件工程原理等任务。与传统基准不同,CodeMMLU评估的是模型对代码的推理能力而非单纯生成能力,从而更深入地揭示其对复杂软件概念与系统的掌握程度。我们的广泛评估表明,即使是当前最先进的模型在CodeMMLU上也面临显著挑战,这凸显了它们在代码生成之外的理解能力缺陷。通过强调代码理解与有效生成之间的关键关联,CodeMMLU为推动AI辅助软件开发提供了重要资源,最终致力于构建更可靠、更强大的编程助手。