In this paper we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. Key to our findings is a new dataset of 161 prompts with their associated python solutions, dataset which is released at https://huggingface.co/datasets/CohereForAI/lbpp .
翻译:本文探讨代码生成测试集的污染问题,尤其关注其在现代大语言模型中的应用。我们讨论了此类污染的三种潜在来源,并通过实证发现支持每种假设:(i) 直接数据泄露,(ii) 通过合成数据产生的间接数据泄露,以及 (iii) 模型选择过程中对评估集的过拟合。本研究的关键在于构建了一个包含161个提示及其对应Python解决方案的新数据集,该数据集已发布于 https://huggingface.co/datasets/CohereForAI/lbpp。