In this paper, we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions. LBPP is released at https://huggingface.co/datasets/CohereForAI/lbpp .
翻译:本文探讨代码生成测试集的污染问题,尤其关注其在现代大语言模型中的应用。我们分析了此类污染的三个潜在来源,并提供了支持每种来源的实证发现:(i) 直接数据泄露,(ii) 通过合成数据产生的间接数据泄露,以及 (iii) 模型选择过程中对评估集的过拟合。为解决此问题,我们发布了"非基础Python问题集":一个包含161个提示及其对应Python解决方案的未污染新型基准测试集。LBPP数据集发布于 https://huggingface.co/datasets/CohereForAI/lbpp。