CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Large Language Models (LLMs) have demonstrated remarkable performance on coding related tasks, particularly on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities of LLMs suffer from severe limitations. First, most benchmarks are deficient as they focus on a narrow range of popular programming languages and specific tasks, whereas the real-world software development scenarios show dire need to implement systems with multilingual programming environments to satisfy diverse requirements. Practical programming practices also strongly expect multi-task settings for testing coding capabilities of LLMs comprehensively and robustly. Second, most benchmarks also fail to consider the actual executability and the consistency of execution results of the generated code. To bridge these gaps between existing benchmarks and expectations from practical applications, we introduce CodeScope, an execution-based, multilingual, multi-task, multi-dimensional evaluation benchmark for comprehensively gauging LLM capabilities on coding tasks. CodeScope covers 43 programming languages and 8 coding tasks. It evaluates the coding performance of LLMs from three dimensions (perspectives): difficulty, efficiency, and length. To facilitate execution-based evaluations of code generation, we develop MultiCodeEngine, an automated code execution engine that supports 14 programming languages. Finally, we systematically evaluate and analyze 8 mainstream LLMs on CodeScope tasks and demonstrate the superior breadth and challenges of CodeScope for evaluating LLMs on code understanding and generation tasks compared to other benchmarks. The CodeScope benchmark and datasets are publicly available at https://github.com/WeixiangYAN/CodeScope.

翻译：摘要：大语言模型（LLMs）在编码相关任务中展现了卓越性能，特别是在辅助人类编程和促进编程自动化方面。然而，现有用于评估LLMs代码理解与生成能力的基准存在严重局限性。首先，大多数基准存在缺陷，因其仅聚焦于少数流行编程语言和特定任务，而现实软件开发场景迫切需要借助多语言编程环境实现系统开发以满足多样化需求。实际编程实践也强烈期望通过多任务设置全面且稳健地测试LLMs的编码能力。其次，多数基准未能考虑生成代码的实际可执行性及执行结果的一致性。为弥合现有基准与实际应用期望之间的差距，我们提出CodeScope——一个基于执行的多语言、多任务、多维度评估基准，用于全面衡量LLMs在编码任务上的能力。CodeScope涵盖43种编程语言和8类编码任务，并从三个维度（视角）评估LLMs的编码性能：难度、效率与长度。为促进基于执行的代码生成评估，我们开发了MultiCodeEngine——一个支持14种编程语言的自动化代码执行引擎。最终，我们系统性地评估并分析了8款主流LLMs在CodeScope任务上的表现，相较于其他基准，证明了CodeScope在评估LLMs代码理解与生成任务方面具有更广泛的覆盖面和更高的挑战性。CodeScope基准与数据集已公开于https://github.com/WeixiangYAN/CodeScope。