Automatic evaluation methods for large language models (LLMs) are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, which incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Starting with a question in a conventional LLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically generated, multi-round, and knowledge-focused dialogues to determine whether a model's response is merely a recall of benchmark answers or demonstrates a deep comprehension to apply knowledge in more complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. We also reveal that data contamination brings no contribution or even negative effect to models' real-world applicability and understanding, and existing contamination detection methods for LLMs can only identify contamination in pre-training but not during supervised fine-tuning.
翻译:大语言模型(LLM)的自动评估方法受数据污染影响,导致对其效能的评估结果虚高。现有旨在检测污染文本的策略侧重于量化污染状况,而非准确衡量模型性能。本文提出KIEval,一种知识驱动的交互式评估框架,首次引入基于LLM的“交互器”角色,实现动态的抗污染评估。该方法从涉及领域知识的传统LLM基准问题出发,通过动态生成的多轮次、知识聚焦的对话,判别模型的响应仅是基准答案的复现,还是展现了在更复杂对话中应用知识的深度理解能力。在五个数据集上对七个主流LLM开展的广泛实验验证了KIEval的有效性与泛化能力。研究同时揭示:数据污染对模型的实际应用能力与理解力无贡献甚至产生负面影响,且现有LLM污染检测方法仅能识别预训练阶段的污染,无法检测监督微调过程中的污染。