Large Language Models (LLMs) have demonstrated remarkable performance on coding related tasks, particularly on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities of LLMs suffer from severe limitations. First, most benchmarks are deficient as they focus on a narrow range of popular programming languages and specific tasks, whereas the real-world software development scenarios show dire need to implement systems with multilingual programming environments to satisfy diverse requirements. Practical programming practices also strongly expect multi-task settings for testing coding capabilities of LLMs comprehensively and robustly. Second, most benchmarks also fail to consider the actual executability and the consistency of execution results of the generated code. To bridge these gaps between existing benchmarks and expectations from practical applications, we introduce CodeScope, an execution-based, multilingual, multi-task, multi-dimensional evaluation benchmark for comprehensively gauging LLM capabilities on coding tasks. CodeScope covers 43 programming languages and 8 coding tasks. It evaluates the coding performance of LLMs from three dimensions (perspectives): difficulty, efficiency, and length. To facilitate execution-based evaluations of code generation, we develop MultiCodeEngine, an automated code execution engine that supports 14 programming languages. Finally, we systematically evaluate and analyze 8 mainstream LLMs on CodeScope tasks and demonstrate the superior breadth and challenges of CodeScope for evaluating LLMs on code understanding and generation tasks compared to other benchmarks. The CodeScope benchmark and datasets are publicly available at https://github.com/WeixiangYAN/CodeScope.
翻译:大语言模型(LLMs)在编码相关任务中展现出卓越性能,特别是在辅助人类编程和促进编程自动化方面。然而,现有用于评估LLMs代码理解与生成能力的基准存在严重局限。首先,多数基准存在缺陷,仅关注少数主流编程语言和特定任务,而实际软件开发场景迫切需要在多语言编程环境中实现系统,以满足多样化需求。实际编程实践也强烈期望通过多任务设置来全面且稳健地测试LLMs的编码能力。其次,多数基准未能考虑生成代码的实际可执行性及其执行结果的一致性。为弥合现有基准与实际应用期望之间的差距,我们提出CodeScope——一个基于执行的多语言、多任务、多维评估基准,用于全面衡量LLMs在编码任务上的能力。CodeScope涵盖43种编程语言和8项编码任务,从三个维度(难度、效率、长度)评估LLMs的编码性能。为支持基于执行的代码生成评估,我们开发了MultiCodeEngine——一个支持14种编程语言的自动化代码执行引擎。最后,我们系统评估并分析了8个主流LLMs在CodeScope任务上的表现,展示了CodeScope相比其他基准在评估LLMs代码理解与生成任务方面更广泛的覆盖度和更具挑战性的特点。CodeScope基准与数据集已公开于https://github.com/WeixiangYAN/CodeScope。