Motivation. Large language models (LLMs) have exhibited remarkable proficiency in diverse software engineering (SE) tasks. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase. Problem statement. Data leakage is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While intra-dataset code duplication examines this intersection within a given dataset and has been addressed in prior research, inter-dataset code duplication, which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of LLM evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics. Contribution. This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating LLMs across diverse SE tasks. Study design. We conduct an empirical study using the CSN dataset, a widely adopted pre-training dataset, and five fine-tuning datasets used for various SE tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Then, we fine-tune four models pre-trained on CSN to evaluate their performance on samples encountered during pre-training and those unseen during that phase. Results. Our findings reveal a potential threat to the evaluation of various LLMs across multiple SE tasks, stemming from the inter-dataset code duplication phenomenon. Moreover, we demonstrate that this threat is accentuated by factors like the LLM's size and the chosen fine-tuning technique.
翻译:动机。大语言模型在多种软件工程任务中展现出卓越的能力。处理此类任务通常需要先在预训练阶段获取大规模通用数据集上的基础编码知识,随后在微调阶段针对较小的任务特定数据集进行优化。问题陈述。数据泄露是机器学习模型训练中一个众所周知的问题。该问题的一个表现是训练集与测试集存在交集。尽管同数据集代码重复(研究给定数据集内部的交叠现象)已在先前研究中得到关注,但跨数据集代码重复(衡量不同数据集间的重叠程度)仍鲜有探索。若该现象存在,由于预训练阶段已接触过微调测试样本(导致性能指标虚高),将损害大语言模型评估的完整性。贡献。本文探究了跨数据集代码重复现象及其对多样化软件工程任务中大语言模型评估的影响。研究设计。我们采用广泛使用的预训练数据集CSN与五个面向不同软件工程任务的微调数据集开展实证研究。首先通过去重过程识别预训练数据集与微调数据集之间的交集,随后对四个基于CSN预训练的模型进行微调,评估它们在预训练阶段已接触与未接触样本上的表现。结果。研究结果揭示,跨数据集代码重复现象可能对多种软件工程任务中大语言模型的评估构成潜在威胁。此外,我们证明该威胁会因大语言模型规模、所选微调技术等因素而加剧。