Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings, this memorization can even transfer to "uncontaminated" languages. Using the FLORES-200 translation benchmark as a diagnostic, we study two 7-8B instruction-tuned multilingual LLMs: Bloomz, which was trained on FLORES, and Llama as an uncontaminated control. We confirm Bloomz's FLORES contamination and demonstrate that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Further analysis shows that recall of memorized references often persists despite various source-side perturbation efforts like paraphrasing and named entity replacement. However, replacing named entities leads to a consistent decrease in BLEU, suggesting an effective probing method for memorization in contaminated models.
翻译:大型语言模型(LLMs)可能存在基准测试污染,导致评分虚高,使记忆行为被掩盖为泛化能力;在多语言场景中,这种记忆效应甚至会迁移至"未受污染"的语言。本研究以FLORES-200翻译基准作为诊断工具,考察了两个经指令微调的7-8B参数多语言LLM:在FLORES数据上训练过的Bloomz模型,以及作为未污染对照组的Llama模型。我们证实了Bloomz存在FLORES数据污染,并证明机器翻译污染可能具有跨方向性——由于目标端记忆效应,模型在未见过的翻译方向上也会出现性能虚高。进一步分析表明,即使对源语言进行多种干扰(如复述改写和命名实体替换),模型对记忆参考译文的召回依然持续存在。然而,命名实体替换会导致BLEU评分系统性下降,这为检测污染模型的记忆效应提供了一种有效的探测方法。