In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark. The extent of the problem is unknown, as it is not straightforward to measure. Contamination causes an overestimation of the performance of a contaminated model in a target benchmark and associated task with respect to their non-contaminated counterparts. The consequences can be very harmful, with wrong scientific conclusions being published while other correct ones are discarded. This position paper defines different levels of data contamination and argues for a community effort, including the development of automatic and semi-automatic measures to detect when data from a benchmark was exposed to a model, and suggestions for flagging papers with conclusions that are compromised by data contamination.
翻译:本立场论文认为,使用标注基准对自然语言处理任务进行传统评估正面临困境。当大语言模型在某个基准的测试集上完成训练后,又在该基准上进行评估时,就会发生最严重的数据污染问题。由于难以直接测量,该问题的严重程度尚不明确。数据污染会导致被污染模型在目标基准及关联任务上的性能被高估,相较未受污染模型产生偏差。这种偏差可能造成严重危害,例如错误科研结论被发表,而正确结论却被否决。本立场论文定义了不同等级的数据污染,并呼吁学界共同努力,包括开发自动与半自动检测方法以识别基准数据是否被模型接触过,同时建议对可能因数据污染而影响结论的论文进行标记预警。