Investigating Data Contamination in Modern Benchmarks for Large Language Models

Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named \textbf{T}estset \textbf{S}lot Guessing (\textit{TS-Guessing}), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the TruthfulQA benchmark, we find that LLMs exhibit notable performance improvement when provided with additional metadata in the benchmark. Further, in the MMLU benchmark, ChatGPT and GPT-4 demonstrated an exact match rate of 52\% and 57\%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field.

翻译：近期观察结果凸显了大语言模型虚高的基准测试分数与其实际性能之间的差距，引发了对其评估基准可能存在污染的担忧。对于训练数据透明度不足的闭源模型及部分开源模型而言，这一问题尤为严峻。本文通过提出两种分别适用于开源和专有大型语言模型的方法来研究数据污染。我们首先引入了一种基于检索的系统，用于探索评估基准与预训练语料库之间的潜在重叠。进一步提出了一种名为\textbf{测试集槽位猜测}（\textit{TS-Guessing}）的新型检测协议，可同时适用于开源与专有模型。该方法通过掩盖多项选择题中的错误答案，诱导模型填补空缺；同时也会隐藏评估样本中某个不常见词汇，要求模型进行还原。我们发现某些商业LLM能够出人意料地猜出多个测试集中缺失的选项。具体而言，在TruthfulQA基准测试中，当提供该基准的额外元数据时，LLM展现出显著的性能提升。而在MMLU基准测试中，ChatGPT与GPT-4在猜测基准测试数据缺失选项时，精确匹配率分别达到52%和57%。我们期望这些结果能凸显该领域亟需更稳健的评估方法与基准测试体系。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日