In recent years, code intelligence has gained increasing importance in the field of automated software engineering. Meanwhile, the widespread adoption of Pretrained Language Models (PLMs) and Large Language Models (LLMs) has raised concerns regarding data contamination and its potential impact on model performance evaluation. Previous studies mainly focused on sample-level contamination, ignoring partial contamination scenarios that are pervasive in code intelligence. This paper fills this gap and presents a systematic empirical study to investigate the fine-grained data contamination on mainstream code tasks. Our study involves diverse representative PLMs: RoBERTa and GPT-2, and LLMs: LLaMA and StarCoder, covering three major tasks: code translation, code generation, and code summarization, across two Programming Languages (PLs): Java and Python. We categorize contamination scenarios into four types according to the code intelligence practice, namely input-only, output-only, unpaired, and paired contamination settings, and construct corresponding experimental and control groups for exploration. Experimental results show that, under the pre-training, fine-tuning, and inference paradigm adopted by PLMs, even deliberately injecting paired contamination does not lead to significant performance overestimation. But direct inference or small-scale fine-tuning uncovers the contamination effects. In contrast, LLMs with pre-training and inference paradigm are significantly affected by the paired contamination. Apart from the above, other contamination scenarios have no impact on both PLMs and LLMs. Our findings challenge the conventional belief that contamination inevitably leads to performance overestimation, providing new insights into the evaluation and deployment of code intelligence models.
翻译:近年来,代码智能在自动化软件工程领域的重要性日益凸显。与此同时,预训练语言模型和大语言模型的广泛应用引发了人们对数据污染及其对模型性能评估潜在影响的担忧。先前的研究主要关注样本级污染,忽视了代码智能中普遍存在的部分污染场景。本文填补了这一空白,通过系统的实证研究探讨主流代码任务中的细粒度数据污染问题。我们的研究涵盖了多种代表性模型:预训练语言模型(RoBERTa和GPT-2)与大语言模型(LLaMA和StarCoder),涉及代码翻译、代码生成和代码摘要三大任务,覆盖Java和Python两种编程语言。根据代码智能实践,我们将污染场景划分为四种类型:仅输入污染、仅输出污染、非配对污染和配对污染,并构建相应的实验组与对照组进行探究。实验结果表明,在预训练语言模型采用的预训练-微调-推理范式下,即使刻意注入配对污染也不会导致显著的性能高估,但直接推理或小规模微调会揭示污染效应。相比之下,采用预训练-推理范式的大语言模型则显著受到配对污染的影响。除上述情况外,其他污染场景对预训练语言模型和大语言模型均无影响。我们的研究发现挑战了"污染必然导致性能高估"的传统观念,为代码智能模型的评估与部署提供了新的见解。