通过翻译掩盖数据污染：来自阿拉伯语语料库的证据 (Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora)

Data contamination undermines the validity of Large Language Model evaluation by enabling models to rely on memorized benchmark content rather than true generalization. While prior work has proposed contamination detection methods, these approaches are largely limited to English benchmarks, leaving multilingual contamination poorly understood. In this work, we investigate contamination dynamics in multilingual settings by fine-tuning several open-weight LLMs on varying proportions of Arabic datasets and evaluating them on original English benchmarks. To detect memorization, we extend the Tested Slot Guessing method with a choice-reordering strategy and incorporate Min-K% probability analysis, capturing both behavioral and distributional contamination signals. Our results show that translation into Arabic suppresses conventional contamination indicators, yet models still benefit from exposure to contaminated data, particularly those with stronger Arabic capabilities. This effect is consistently reflected in rising Mink% scores and increased cross-lingual answer consistency as contamination levels grow. To address this blind spot, we propose Translation-Aware Contamination Detection, which identifies contamination by comparing signals across multiple translated benchmark variants rather than English alone. The Translation-Aware Contamination Detection reliably exposes contamination even when English-only methods fail. Together, our findings highlight the need for multilingual, translation-aware evaluation pipelines to ensure fair, transparent, and reproducible assessment of LLMs.

翻译：数据污染通过使大型语言模型依赖记忆的基准测试内容而非真正的泛化能力，从而削弱了其评估的有效性。尽管先前的研究提出了污染检测方法，但这些方法主要局限于英语基准测试，导致对多语言污染的理解不足。在本研究中，我们通过在阿拉伯语数据集的不同比例上微调多个开放权重的LLM，并在原始英语基准测试上评估它们，以探究多语言环境下的污染动态。为了检测记忆效应，我们通过选择重排序策略扩展了“测试槽位猜测”方法，并融入了Min-K%概率分析，以捕捉行为和分布层面的污染信号。我们的结果表明，翻译成阿拉伯语会抑制常规的污染指标，但模型仍能从接触污染数据中获益，尤其是那些阿拉伯语能力较强的模型。这种效应一致地反映在随着污染水平增加而上升的Mink%分数以及提高的跨语言答案一致性上。为了解决这一盲点，我们提出了“翻译感知污染检测”方法，该方法通过比较多个翻译基准测试变体而非仅英语版本之间的信号来识别污染。即使仅使用英语的方法失效时，翻译感知污染检测也能可靠地暴露污染。总之，我们的研究结果强调了需要多语言、翻译感知的评估流程，以确保对LLM进行公平、透明和可复现的评估。