Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of \emph{indirect} data leaking, where models are iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI's GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI's data usage policy, we extensively document the amount of data leaked to these models during the first year after the model's release. We report that these models have been globally exposed to $\sim$4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on https://leak-llm.github.io/, where other researchers can contribute to our efforts.

翻译：自然语言处理（NLP）研究日益聚焦于大语言模型（LLM）的使用，其中一些最流行的模型为完全或部分闭源。由于无法获取模型细节（尤其是训练数据），研究人员反复对数据污染问题表示担忧。尽管已有多次尝试应对该问题，但均局限于零散证据和试错法，且忽略了"间接"数据泄露问题——即模型通过使用用户反馈数据被迭代优化。本研究首次系统分析了当前最广泛使用的LLM——OpenAI的GPT-3.5和GPT-4在数据污染背景下的表现。通过分析255篇论文并考虑OpenAI的数据使用政策，我们详尽记录了模型发布后第一年内泄露至这些模型的数据量。结果显示，这些模型已全局暴露于来自263个基准测试的约470万个样本中。与此同时，我们发现所审论文中存在诸多评估失范现象，例如不公平或缺失基线对比、可复现性问题等。我们将研究结果以协作项目形式发布于https://leak-llm.github.io/，欢迎其他研究者共同参与。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日