Concerned with Data Contamination? Assessing Countermeasures in Code Language Model

Various techniques have been proposed to leverage the capabilities of code language models (CLMs) for SE tasks. While these techniques typically evaluate their effectiveness using publicly available datasets, the evaluation can be subject to data contamination threats where the evaluation datasets have already been used to train the concerned CLMs. This can significantly affect the reliability of the evaluation. Different countermeasures have been suggested to mitigate the data contamination threat. Countermeasures include using more recent data, curating new data, and refactoring existing data are introduced, yet it is unclear whether these countermeasures could really mitigate data contamination threats to model evaluation. To fill the gap, we systematically study to quantify the impacts of these countermeasures on CLMs' performance. To facilitate the study, we collected over 2 million Python functions with timestamps ranging from January 1st, 2018, to December 31st, 2023. The data created before the models' cut-off date are considered "contaminated data", while the data where the countermeasures are taken are regarded as "cleansed data". We study the impact of these countermeasures by investigating the difference in CLMs' performance on contaminated and cleansed data derived from different countermeasures. Our experiments yield several interesting observations. For instance, CLMs do not necessarily perform worse on data after the models' cut-off date; on the contrary, they sometimes perform better. In addition, refactoring did not always result in decreased performance; it could lead to improvements instead. Furthermore, existing metrics such as perplexity cannot distinguish contaminated/cleansed data. We hope that the results and observations could help deepen the understanding of CLMs' capabilities and inform the community about data contamination.

翻译：已有多种技术被提出以利用代码语言模型（CLM）完成软件工程任务。尽管这些技术通常使用公开数据集评估其有效性，但评估可能面临数据污染威胁——即评估数据集已被用于训练相关CLM，这会显著影响评估的可靠性。为缓解数据污染威胁，学者们提出了不同对策，包括使用更新数据、整理新数据以及重构现有数据等，但这些措施是否能真正降低模型评估中的数据污染风险仍不明确。为填补这一空白，我们系统研究了这些对策对CLM性能的量化影响。研究中，我们收集了2018年1月1日至2023年12月31日期间超过200万个带时间戳的Python函数。将模型截止日期之前创建的数据视为"受污染数据"，而实施对策后处理的数据视为"净化数据"。通过比较CLM在来自不同对策的受污染数据与净化数据上的性能差异，我们研究了这些对策的影响。实验得出若干有趣发现：例如，CLM在模型截止日期之后的数据上不一定表现更差，反而有时表现更优；此外，重构操作并非总导致性能下降，有时反而能带来提升。同时，现有指标如困惑度无法区分受污染/净化数据。我们期望这些结果与观察能帮助深化对CLM能力的理解，并向学界阐明数据污染问题。