Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval and MBPP. To address this, our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures. Our evaluation of 23 leading large language models (LLMs), including both general and code-specialized models, reveals three key insights: 1) pass@o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field. Our benchmark and scripts are publicly released at: https://github.com/alphadl/OOP-eval.
翻译:推进自动化编程需要稳健且全面的代码生成基准,然而当前评估框架主要关注函数式编程(FP),例如HumanEval和MBPP,而很大程度上忽视了面向对象编程(OOP)。为解决这一问题,本研究首次引入了一个以OOP为核心的基准,包含431个Python程序,涵盖了类与封装方法等关键OOP概念与特性。我们提出了一种专为OOP设计的新评估指标pass@o,改进了传统的pass@k指标。通过对23个主流大语言模型(LLMs)(包括通用模型和代码专用模型)的评估,我们揭示了三个关键发现:1) pass@o为OOP代码生成提供了更相关且更全面的评估;2) 尽管在FP中表现出色,但像WizardCoder这样的代码专用LLM在OOP上逊于ChatGPT等模型;3) 所有先进LLM在我们OOP基准上的较差表现凸显了该领域亟待改进的需求。我们的基准与脚本已在以下地址公开发布:https://github.com/alphadl/OOP-eval。