Towards more realistic evaluation of LLM-based code generation: an experimental study and beyond

To evaluate the code generation capabilities of Large Language Models (LLMs) in complex real-world software development scenarios, many evaluation approaches have been developed. They typically leverage contextual code from the latest version of a project to facilitate LLMs in accurately generating the desired function. However, such evaluation approaches fail to consider the dynamic evolution of software projects over time, which we refer to as evolving-ignored situation, leading to issues of future context leakage and useful context missing. This in turn results in inaccurate evaluation of LLMs' performance. In this paper, we conduct an empirical study to deeply understand LLMs' code generation performance within settings that reflect the evolving nature of software development. To achieve this, we first construct an evolving-aware repository-level code generation dataset, namely HumanEvo, equipped with an automated execution-based evaluation tool. Second, we manually categorize HumanEvo according to dependency levels to more comprehensively analyze the model's performance in generating functions with different dependency levels. Third, we conduct extensive experiments on HumanEvo with seven representative and diverse LLMs to verify the effectiveness of the proposed benchmark. We obtain many important findings through our experimental study. For example, we find that previous evolving-ignored evaluation approaches lead to inflated performance of the LLMs, ranging from 10.0% to 61.1%. Based on the findings, we give actionable suggestions on more realistic evaluation of LLMs on code generation. We also build a shared evolving-aware code generation toolbox to facilitate future research. Replication package including source code, datasets and appendix is available at https://github.com/DeepSoftwareAnalytics/EvoEval.

翻译：为评估大语言模型在复杂真实软件开发场景中的代码生成能力，目前已开发出多种评估方法。这些方法通常利用项目最新版本的上下文代码，帮助大语言模型准确生成所需函数。然而，此类评估方法未考虑软件项目随时间动态演化的特性（我们称之为“演化忽略”情境），导致出现未来上下文泄露和有用上下文缺失的问题，进而造成对大语言模型性能的不准确评估。本文通过实证研究，深入理解大语言模型在反映软件开发演化特性的场景中的代码生成性能。为此，我们首先构建了一个具备演化感知能力的仓库级代码生成数据集HumanEvo，并配套开发了自动化执行评估工具。其次，我们根据依赖级别对HumanEvo进行人工分类，以更全面地分析模型在生成不同依赖级别函数时的表现。第三，我们在HumanEvo上使用七种具有代表性且差异显著的大语言模型开展广泛实验，验证所提基准的有效性。通过实验研究，我们获得了多项重要发现，例如发现先前的“演化忽略”评估方法导致大语言模型的性能被高估10.0%至61.1%。基于这些发现，我们针对如何更真实地评估大语言模型的代码生成能力提出了可操作建议，并构建了共享的演化感知代码生成工具箱以促进未来研究。包含源代码、数据集及附录的复现包可在https://github.com/DeepSoftwareAnalytics/EvoEval获取。