CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Large Language Models (LLMs) have demonstrated remarkable performance on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities of LLMs suffer from severe limitations. First, most benchmarks are insufficient as they focus on a narrow range of popular programming languages and specific tasks, whereas real-world software development scenarios show a critical need to implement systems with multilingual and multitask programming environments to satisfy diverse requirements. Second, most benchmarks fail to consider the actual executability and the consistency of execution results of the generated code. To bridge these gaps between existing benchmarks and expectations from practical applications, we introduce CodeScope, an execution-based, multilingual, multitask, multidimensional evaluation benchmark for comprehensively measuring LLM capabilities on coding tasks. CodeScope covers 43 programming languages and eight coding tasks. It evaluates the coding performance of LLMs from three dimensions (perspectives): length, difficulty, and efficiency. To facilitate execution-based evaluations of code generation, we develop MultiCodeEngine, an automated code execution engine that supports 14 programming languages. Finally, we systematically evaluate and analyze eight mainstream LLMs and demonstrate the superior breadth and challenges of CodeScope for evaluating LLMs on code understanding and generation tasks compared to other benchmarks. The CodeScope benchmark and code are publicly available at https://github.com/WeixiangYAN/CodeScope.

翻译：大语言模型（LLMs）在辅助人类编程和促进编程自动化方面展现出卓越性能。然而，现有用于评估LLMs代码理解与生成能力的基准存在严重局限性。首先，多数基准覆盖面不足，仅聚焦于少数流行编程语言和特定任务，而实际软件开发场景亟需实现支持多语言多任务编程环境的系统以满足多样化需求。其次，现有基准普遍忽视生成代码的实际可执行性及执行结果的一致性。为弥合现有基准与实际应用需求之间的差距，我们提出CodeScope——一种基于执行的多语言、多任务、多维评估基准，用于全面衡量LLMs在编程任务上的能力。CodeScope涵盖43种编程语言和8类编程任务，并从三个维度（视角）评估LLMs的编码性能：代码长度、难度与效率。为支持基于执行的代码生成评估，我们开发了MultiCodeEngine——一个支持14种编程语言的自动化代码执行引擎。最后，我们系统评估并分析了八种主流LLMs，展示了CodeScope相较于其他基准在评估LLMs代码理解与生成任务上的广泛覆盖性与挑战性。CodeScope基准及代码已在https://github.com/WeixiangYAN/CodeScope公开。