Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of requirements and code semantics. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of twenty-two code LMs. Our finding is that they generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups. Surprisingly, even the classification accuracy on functional-correctness instances derived from the popular HumanEval benchmark is low, calling in question the depth of their comprehension and the source of their success in generating functionally-correct code in the first place. We will release our benchmark and evaluation scripts publicly at https://aka.ms/NoFunEval.
翻译:现有代码语言模型(code LMs)的评估基准几乎完全聚焦于模型能否生成功能正确的代码。但在实际软件工程中,开发者的思考远不止功能正确性——他们还对"如何"实现功能以满足效率、安全性和可维护性等整体系统设计目标存在要求。若模型能展现出对需求及代码语义的稳健理解,开发者会对其产生更高信任度。我们提出新基准NoFunEval,用于评估代码语言模型在非功能性需求及功能性与非功能性需求的简单分类实例上的表现。我们提出一种提示方法——编码概念(CoCo),使开发者能够向模型传递领域知识。我们基于22个代码语言模型开展了广泛评估,发现这些模型在测试中普遍表现欠佳,暴露出其训练范式中存在根本性盲区。令人惊讶的是,即使针对源自经典HumanEval基准的功能正确性实例,模型的分类准确率也较低,这不禁令人质疑其代码理解的深度及其最初生成功能正确代码的能力来源。我们将于https://aka.ms/NoFunEval公开释出基准测试集与评估脚本。