Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of \emph{indirect} data leaking, where models are iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI's GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI's data usage policy, we extensively document the amount of data leaked to these models during the first year after the model's release. We report that these models have been globally exposed to $\sim$4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on https://leak-llm.github.io/, where other researchers can contribute to our efforts.
翻译:自然语言处理(NLP)研究日益聚焦于大语言模型(LLM)的使用,其中一些最流行的模型为完全或部分闭源。由于无法获取模型细节(尤其是训练数据),研究人员反复对数据污染问题表示担忧。尽管已有多次尝试应对该问题,但均局限于零散证据和试错法,且忽略了"间接"数据泄露问题——即模型通过使用用户反馈数据被迭代优化。本研究首次系统分析了当前最广泛使用的LLM——OpenAI的GPT-3.5和GPT-4在数据污染背景下的表现。通过分析255篇论文并考虑OpenAI的数据使用政策,我们详尽记录了模型发布后第一年内泄露至这些模型的数据量。结果显示,这些模型已全局暴露于来自263个基准测试的约470万个样本中。与此同时,我们发现所审论文中存在诸多评估失范现象,例如不公平或缺失基线对比、可复现性问题等。我们将研究结果以协作项目形式发布于https://leak-llm.github.io/,欢迎其他研究者共同参与。