Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval and MBPP. To address this, our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures. Our evaluation of 23 leading large language models (LLMs), including both general and code-specialized models, reveals three key insights: 1) pass@o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field. Our benchmark and scripts are publicly released at: https://github.com/alphadl/OOP-eval.
翻译:推动自动化编程需要健壮且全面的代码生成基准,然而当前的评估框架大多忽略面向对象编程(OOP),而侧重于函数式编程(FP),例如HumanEval和MBPP。为解决这一问题,本研究首次引入了针对OOP的基准测试,包含431个Python程序,涵盖类和封装方法等核心OOP概念与特性。我们提出了一种针对OOP的新型评估指标pass@o,对传统的pass@k指标进行了增强。通过对23个主流大型语言模型(LLMs)(包括通用模型和代码专用模型)的评估,我们揭示了三个关键发现:1)pass@o为OOP代码生成提供了更相关、更全面的评估;2)尽管在FP方面表现出色,但WizardCoder等代码专用LLM在OOP上落后于ChatGPT等模型;3)所有先进LLM在OOP基准上的较差表现凸显了该领域亟需改进的必要性。我们的基准与脚本已在https://github.com/alphadl/OOP-eval公开。