Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named \textbf{T}estset \textbf{S}lot Guessing (\textit{TS-Guessing}), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the TruthfulQA benchmark, we find that LLMs exhibit notable performance improvement when provided with additional metadata in the benchmark. Further, in the MMLU benchmark, ChatGPT and GPT-4 demonstrated an exact match rate of 52\% and 57\%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field.
翻译:近期观察发现,大型语言模型(LLM)虚高的基准测试分数与其实际性能之间存在差距,引发了对评估基准可能遭受数据污染的担忧。这一问题对训练数据缺乏透明度的闭源模型及部分开源模型尤为关键。本文通过提出两种分别适用于开源和专有LLM的方法来研究数据污染。我们首先引入基于检索的系统,探索评估基准与预训练语料库之间的潜在重叠。进而提出一种名为\textbf{测试集} \textbf{槽位猜测}(\textit{TS-Guessing})的新型调查方案,可同时适用于开源和专有模型。该方法包括在多项选择题中掩盖错误答案并提示模型填空,以及在评估示例中遮蔽非常用词汇并要求模型生成。我们发现某些商业LLM能够出人意料地猜测出多个测试集中的缺失选项。具体而言,TruthfulQA基准测试中,当提供额外元数据时,LLM性能显著提升。而在MMLU基准测试中,ChatGPT和GPT-4在猜测基准测试数据缺失选项时,精确匹配率分别达到52%和57%。我们希望这些结果能凸显该领域亟需更稳健的评估方法与基准测试体系。