While large language models have achieved remarkable performance on various code generation benchmarks, there have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data. While recent work has investigated contamination in natural language generation and understanding tasks, there has been less extensive research into how data contamination impacts the evaluation of code generation, which is critical for understanding the robustness and reliability of LLMs in programming contexts. In this work, we perform a comprehensive study of data contamination of popular code generation benchmarks, and precisely quantify their overlap with pretraining corpus through both surface-level and semantic-level matching. In our experiments, we show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training. We also conduct extensive analysis on the factors that affects model memorization and generalization, such as model size, problem difficulty, and question length. We release all resulting files from our matching pipeline for future research.
翻译:尽管大型语言模型在各种代码生成基准测试中取得了显著性能,但人们日益关注这些基准测试可能因泄露至预训练和微调数据而遭受污染的问题。近期研究虽已探讨自然语言生成与理解任务中的污染现象,但针对数据污染如何影响代码生成评估的系统性研究仍相对不足,而这对于理解语言模型在编程场景中的鲁棒性与可靠性至关重要。本研究对主流代码生成基准测试的数据污染问题进行了全面分析,通过表层匹配与语义匹配双重方法精确量化了其与预训练语料库的重叠程度。实验表明,主流代码生成基准测试与公开训练语料库存在显著重叠,且模型在训练期间见过相似解法的子集上表现明显更优。此外,我们还系统分析了影响模型记忆与泛化能力的因素,包括模型规模、问题难度及问题长度。最终公开了本匹配流程的所有结果文件,以支持后续研究。