Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named \textbf{T}estset \textbf{S}lot Guessing (\textit{TS-Guessing}), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the TruthfulQA benchmark, we find that LLMs exhibit notable performance improvement when provided with additional metadata in the benchmark. Further, in the MMLU benchmark, ChatGPT and GPT-4 demonstrated an exact match rate of 52\% and 57\%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field.
翻译:近期观察结果凸显了大语言模型虚高的基准测试分数与其实际性能之间的差距,引发了对其评估基准可能存在污染的担忧。对于训练数据透明度不足的闭源模型及部分开源模型而言,这一问题尤为严峻。本文通过提出两种分别适用于开源和专有大型语言模型的方法来研究数据污染。我们首先引入了一种基于检索的系统,用于探索评估基准与预训练语料库之间的潜在重叠。进一步提出了一种名为\textbf{测试集槽位猜测}(\textit{TS-Guessing})的新型检测协议,可同时适用于开源与专有模型。该方法通过掩盖多项选择题中的错误答案,诱导模型填补空缺;同时也会隐藏评估样本中某个不常见词汇,要求模型进行还原。我们发现某些商业LLM能够出人意料地猜出多个测试集中缺失的选项。具体而言,在TruthfulQA基准测试中,当提供该基准的额外元数据时,LLM展现出显著的性能提升。而在MMLU基准测试中,ChatGPT与GPT-4在猜测基准测试数据缺失选项时,精确匹配率分别达到52%和57%。我们期望这些结果能凸显该领域亟需更稳健的评估方法与基准测试体系。