Are Large Language Models Memorizing Bug Benchmarks?

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and n-gram accuracy. Our findings show that certain models, in particular codegen-multi, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like LLaMa 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.

翻译：大型语言模型（LLMs）已成为各类软件工程任务不可或缺的组成部分，包括代码生成、缺陷检测与修复。为评估模型在这些领域的性能，研究人员已开发出众多包含来自软件项目的真实缺陷的基准测试集。然而，软件工程界日益担忧的是，由于数据泄露风险，这些基准测试可能无法可靠地反映LLMs的真实性能。尽管存在这种担忧，目前量化潜在泄露影响的研究仍十分有限。本文中，我们系统评估了主流LLMs，以衡量其受广泛使用的缺陷基准测试数据泄露影响的程度。为识别潜在泄露，我们采用多种度量指标，包括对常用训练数据集中基准测试成员资格的研究，以及负对数似然和n-gram准确率分析。研究结果表明，特定模型（尤其是codegen-multi）在Defects4J等广泛使用的基准测试中表现出明显的记忆迹象，而基于更大数据集（如LLaMa 3.1）训练的新模型则显示出有限的泄露特征。这些发现凸显了谨慎选择基准测试并采用稳健度量指标以充分评估模型能力的必要性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日