Due to the expanding capabilities and pre-training data, Large Language Models (LLMs) are facing increasingly serious evaluation challenges. On one hand, the data leakage issue cause over-estimation on existing benchmarks. On the other hand, periodically curating datasets manually is costly. In this paper, we propose to automate dataset updates for reliable and timely evaluation. The basic idea is to generate unseen and high-quality testing samples based on existing ones to mitigate leakage issues. In specific, we propose two strategies with systematically verification. First, the mimicking strategy employs LLMs to create new samples resembling existing ones, to the maximum extent preserving the stylistic of the original dataset. Our experiments demonstrate its evaluation stability across multiple instantiations and its effectiveness in dealing with data leakage issues in most cases. Second, for the cases that mimicking dataset works poorly, we design an extending strategy that adjusts the difficulty of the generated samples according to varying cognitive levels. This not only makes our evaluation more systematic, but also, with a balanced difficulty, even discern model capabilities better at fine-grained levels.
翻译:由于大语言模型(LLMs)能力与预训练数据的持续扩展,其评估正面临日益严峻的挑战。一方面,数据泄露问题导致现有基准测试出现过高估计;另一方面,人工定期整理数据集成本高昂。本文提出通过自动化数据集更新实现可靠且及时的评估。其核心思想是基于现有测试样本生成未见且高质量的新样本以缓解泄露问题。具体而言,我们提出两种策略并进行了系统性验证。首先,模仿策略利用LLMs生成与现有样本相似的新样本,最大程度保留原始数据集的风格特征。实验表明,该策略在多次实例化中均保持评估稳定性,且能有效应对大多数情况下的数据泄露问题。其次,针对模仿数据集表现不佳的场景,我们设计了扩展策略,根据认知层次差异调整生成样本的难度。这不仅使评估更具系统性,还能通过平衡难度,在细粒度层面更有效地判别模型能力。