Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of \emph{indirect} data leaking, where models are iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI's GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI's data usage policy, we extensively document the amount of data leaked to these models during the first year after the model's release. We report that these models have been globally exposed to $\sim$4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on https://leak-llm.github.io/, where other researchers can contribute to our efforts.
翻译:自然语言处理(NLP)研究日益聚焦于大型语言模型(LLM)的应用,其中一些最受欢迎的模型完全或部分闭源。由于缺乏对模型细节(尤其是训练数据)的访问权限,研究人员多次对数据污染问题表示担忧。尽管已有若干尝试应对该问题,但均局限于轶事证据和试错法,且忽视了模型通过用户数据迭代改进所引发的间接数据泄露问题。本研究首次系统分析了当前最广泛使用的LLM——OpenAI的GPT-3.5与GPT-4——在数据污染背景下的表现。通过分析255篇论文并参考OpenAI的数据使用政策,我们全面记录了模型发布后首年内其泄露的数据量。结果表明,这些模型总计接触了来自263个基准测试的约470万个样本。与此同时,我们发现了论文中涌现的多种评估失范行为,例如不公平或缺失基线比较、结果可重复性问题。我们已通过协作项目平台(https://leak-llm.github.io/)公开研究结果,供其他研究人员共同完善相关工作。