Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and n-gram accuracy. Our findings show that certain models, in particular codegen-multi, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like LLaMa 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.
翻译:大型语言模型(LLMs)已成为各类软件工程任务不可或缺的组成部分,包括代码生成、缺陷检测与修复。为评估模型在这些领域的性能,研究人员已开发出众多包含来自软件项目的真实缺陷的基准测试集。然而,软件工程界日益担忧的是,由于数据泄露风险,这些基准测试可能无法可靠地反映LLMs的真实性能。尽管存在这种担忧,目前量化潜在泄露影响的研究仍十分有限。本文中,我们系统评估了主流LLMs,以衡量其受广泛使用的缺陷基准测试数据泄露影响的程度。为识别潜在泄露,我们采用多种度量指标,包括对常用训练数据集中基准测试成员资格的研究,以及负对数似然和n-gram准确率分析。研究结果表明,特定模型(尤其是codegen-multi)在Defects4J等广泛使用的基准测试中表现出明显的记忆迹象,而基于更大数据集(如LLaMa 3.1)训练的新模型则显示出有限的泄露特征。这些发现凸显了谨慎选择基准测试并采用稳健度量指标以充分评估模型能力的必要性。