Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities, which has led to their rapid adoption in software engineering applications. However, details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included. In lieu of the training data for the popular GPT models, we examine the training data of the open-source LLM StarCoder, and find it likely that data from the widely used Defects4J benchmark was included, raising the possibility of its inclusion in GPT training data as well. This makes it difficult to tell how well LLM-based results on Defects4J would generalize, as for any results it would be unclear whether a technique's performance is due to LLM generalization or memorization. To remedy this issue and facilitate continued research on LLM-based SE, we present the GitHub Recent Bugs (GHRB) dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
翻译:大语言模型(LLM)在自然语言处理和代码合成方面展现出强大的能力,这使得它们被迅速应用于软件工程领域。然而,LLM训练数据的细节通常不对外公开,由此引发了对现有缺陷基准测试是否被包含在训练数据中的担忧。针对流行的GPT模型训练数据不可知的情况,我们检查了开源LLM StarCoder的训练数据,发现广泛使用的Defects4J基准测试中的数据很可能已被包含其中,这增加了该数据同样被纳入GPT训练数据的可能性。这使得难以判断基于LLM的Defects4J结果具备何种泛化能力——因为对于任何结果而言,都难以区分技术的性能究竟是源于LLM的泛化能力还是记忆效应。为解决此问题并促进基于LLM的软件工程持续研究,我们提出了GitHub近期缺陷(GHRB)数据集,其中包含76个在OpenAI数据截断点之后收集的真实Java缺陷。