Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of such requirements. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of 27 code LMs. Our finding is that LMs generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups. Surprisingly, even the classification accuracy on functional-correctness instances derived from the popular HumanEval benchmark is low, calling in question the depth of their comprehension and the source of their success in generating functionally-correct code in the first place. We release our benchmark and evaluation scripts publicly at https://aka.ms/NoFunEval.
翻译:现有的代码语言模型评估基准几乎完全聚焦于模型能否生成功能正确的代码。在现实世界的软件工程中,开发者的考量远不止功能正确性。他们对功能"如何"实现有着具体要求,以满足诸如效率、安全性和可维护性等整体系统设计目标。如果代码语言模型能展现出对此类需求的稳健理解,开发者也会对其更加信任。我们提出了一个新的基准测试NoFunEval,用于评估代码语言模型在非功能性需求上的表现,同时包含针对功能性和非功能性需求的简单分类实例。我们提出了一种提示方法——编码概念,作为开发者向语言模型传递领域知识的一种途径。我们对27个代码语言模型进行了广泛评估。我们的发现是,模型在我们的基准测试上普遍表现不佳,这暗示了其训练设置中存在根本性的盲点。令人惊讶的是,即使是在源自流行HumanEval基准的功能正确性实例上,其分类准确率也很低,这对其理解深度以及其最初在生成功能正确代码方面取得成功的根源提出了质疑。我们在https://aka.ms/NoFunEval 公开发布了我们的基准测试和评估脚本。