Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities, which has led to their rapid adoption in software engineering applications. However, details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included. In lieu of the training data for the popular GPT models, we examine the training data of the open-source LLM StarCoder, and find it likely that data from the widely used Defects4J benchmark was included, raising the possibility of its inclusion in GPT training data as well. This makes it difficult to tell how well LLM-based results on Defects4J would generalize, as for any results it would be unclear whether a technique's performance is due to LLM generalization or memorization. To remedy this issue and facilitate continued research on LLM-based SE, we present the GitHub Recent Bugs (GHRB) dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
翻译:大型语言模型(LLMs)展现了强大的自然语言处理与代码合成能力,使其在软件工程领域迅速得到应用。然而,LLM训练数据的细节通常未公开,这引发了关于现有缺陷基准是否被包含的担忧。鉴于当前流行的GPT模型训练数据不可获取,我们检查了开源LLM StarCoder的训练数据,发现其很可能包含了广泛使用的Defects4J基准中的数据,这增加了该数据同样被纳入GPT训练数据的可能性。这一情况使得难以确定基于LLM的方法在Defects4J上的结果能否推广,因为对于任何结果而言,技术性能究竟源于LLM的泛化能力还是记忆能力将变得模糊不清。为解决该问题并促进基于LLM的软件工程持续研究,我们提出了GitHub近期缺陷(GHRB)数据集,其中包含76个真实世界的Java缺陷,这些缺陷均在OpenAI数据截止日期之后收集。