Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of \emph{indirect} data leaking, where models are iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI's GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI's data usage policy, we extensively document the amount of data leaked to these models during the first year after the model's release. We report that these models have been globally exposed to $\sim$4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on https://leak-llm.github.io/, where other researchers can contribute to our efforts.

翻译：自然语言处理（NLP）研究日益聚焦于大型语言模型（LLM）的应用，其中一些最受欢迎的模型完全或部分闭源。由于缺乏对模型细节（尤其是训练数据）的访问权限，研究人员多次对数据污染问题表示担忧。尽管已有若干尝试应对该问题，但均局限于轶事证据和试错法，且忽视了模型通过用户数据迭代改进所引发的间接数据泄露问题。本研究首次系统分析了当前最广泛使用的LLM——OpenAI的GPT-3.5与GPT-4——在数据污染背景下的表现。通过分析255篇论文并参考OpenAI的数据使用政策，我们全面记录了模型发布后首年内其泄露的数据量。结果表明，这些模型总计接触了来自263个基准测试的约470万个样本。与此同时，我们发现了论文中涌现的多种评估失范行为，例如不公平或缺失基线比较、结果可重复性问题。我们已通过协作项目平台（https://leak-llm.github.io/）公开研究结果，供其他研究人员共同完善相关工作。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日