On Inter-dataset Code Duplication and Data Leakage in Large Language Models

Motivation. Large language models (LLMs) have exhibited remarkable proficiency in diverse software engineering (SE) tasks. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase. Problem statement. Data leakage is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While intra-dataset code duplication examines this intersection within a given dataset and has been addressed in prior research, inter-dataset code duplication, which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of LLM evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics. Contribution. This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating LLMs across diverse SE tasks. Study design. We conduct an empirical study using the CSN dataset, a widely adopted pre-training dataset, and five fine-tuning datasets used for various SE tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Then, we fine-tune four models pre-trained on CSN to evaluate their performance on samples encountered during pre-training and those unseen during that phase. Results. Our findings reveal a potential threat to the evaluation of various LLMs across multiple SE tasks, stemming from the inter-dataset code duplication phenomenon. Moreover, we demonstrate that this threat is accentuated by factors like the LLM's size and the chosen fine-tuning technique.

翻译：动机。大语言模型在多种软件工程任务中展现出卓越的能力。处理此类任务通常需要先在预训练阶段获取大规模通用数据集上的基础编码知识，随后在微调阶段针对较小的任务特定数据集进行优化。问题陈述。数据泄露是机器学习模型训练中一个众所周知的问题。该问题的一个表现是训练集与测试集存在交集。尽管同数据集代码重复（研究给定数据集内部的交叠现象）已在先前研究中得到关注，但跨数据集代码重复（衡量不同数据集间的重叠程度）仍鲜有探索。若该现象存在，由于预训练阶段已接触过微调测试样本（导致性能指标虚高），将损害大语言模型评估的完整性。贡献。本文探究了跨数据集代码重复现象及其对多样化软件工程任务中大语言模型评估的影响。研究设计。我们采用广泛使用的预训练数据集CSN与五个面向不同软件工程任务的微调数据集开展实证研究。首先通过去重过程识别预训练数据集与微调数据集之间的交集，随后对四个基于CSN预训练的模型进行微调，评估它们在预训练阶段已接触与未接触样本上的表现。结果。研究结果揭示，跨数据集代码重复现象可能对多种软件工程任务中大语言模型的评估构成潜在威胁。此外，我们证明该威胁会因大语言模型规模、所选微调技术等因素而加剧。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日