The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be \emph{not even wrong}, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from \url{https://github.com/ShangDataLab/Deep-Contam}.
翻译:大型语言模型(LLM)开发过程中的不透明性,引发了对其预训练数据中可能存在的公共基准测试污染的日益担忧。现有的污染检测方法通常基于训练数据与评估数据之间的文本重叠度,这种方法可能过于表面,无法反映更深层次的污染形式。本文首先揭示了一种跨语言形式的污染:通过在翻译后的基准测试集上过度拟合LLM,人为注入污染以提升模型性能,同时规避现有检测方法。随后,我们提出了基于泛化能力的方法来揭示此类深度隐蔽的污染。具体而言,我们通过将原始基准测试中的错误选项替换为其他问题的正确答案来修改测试集,并观察LLM性能的变化。受污染的模型难以泛化至此类更简单的情境——因为在其记忆范围内所有选项均正确,错误选项可能变得"甚至算不上错误"。实验结果表明,跨语言污染能轻易绕过现有检测方法,但无法逃过我们的检测。此外,我们探讨了跨语言污染在解释LLM工作机制以及通过后训练增强LLM多语言能力方面的潜在应用价值。所用代码与数据集可通过\url{https://github.com/ShangDataLab/Deep-Contam}获取。