Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of requirements and code semantics. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of twenty-two code LMs. Our finding is that they generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups. Surprisingly, even the classification accuracy on functional-correctness instances derived from the popular HumanEval benchmark is low, calling in question the depth of their comprehension and the source of their success in generating functionally-correct code in the first place. We will release our benchmark and evaluation scripts publicly at https://aka.ms/NoFunEval.
翻译:现有的代码语言模型(代码LM)评估基准几乎完全关注模型能否生成功能正确的代码。然而在实际软件工程中,开发者的思考远不止功能正确性——他们会对“如何”实现某项功能提出要求,以满足效率、安全性和可维护性等系统设计目标。若代码LM能展现对需求与代码语义的鲁棒理解,开发者也会更信任这些模型。我们提出新基准NoFunEval,用于评估代码LM在非功能性需求上的表现,以及针对功能性与非功能性需求的简单分类实例。我们设计了一种提示方法——编码概念(CoCo),帮助开发者向语言模型传递领域知识。通过对22个代码LM的广泛评估,我们发现这些模型普遍在基准测试中表现不佳,暗示其训练机制存在根本性盲区。令人惊讶的是,即便是在源自流行HumanEval基准的功能正确性实例分类任务中,它们的准确率也较低,这对其代码理解的深度及其成功生成功能正确代码的根本原因提出了质疑。我们将公开基准测试与评估脚本:https://aka.ms/NoFunEval。