Large language models (LLMs) have achieved impressive performance across various natural language benchmarks, prompting a continual need to curate more difficult datasets for larger LLMs, which is costly and time-consuming. In this paper, we propose to automate dataset updating and provide systematic analysis regarding its effectiveness in dealing with benchmark leakage issue, difficulty control, and stability. Thus, once the current benchmark has been mastered or leaked, we can update it for timely and reliable evaluation. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, preserving stylistic and contextual essence, and 2) extending strategy that further expands existing samples at varying cognitive levels by adapting Bloom's taxonomy of educational objectives. Extensive experiments on updated MMLU and BIG-Bench demonstrate the stability of the proposed strategies and find that the mimicking strategy can effectively alleviate issues of overestimation from benchmark leakage. In cases where the efficient mimicking strategy fails, our extending strategy still shows promising results. Additionally, by controlling the difficulty, we can better discern the models' performance and enable fine-grained analysis neither too difficult nor too easy an exam can fairly judge students' learning status. To the best of our knowledge, we are the first to automate updating benchmarks for reliable and timely evaluation. Our demo leaderboard can be found at https://yingjiahao14.github.io/Automating-DatasetUpdates/.
翻译:大型语言模型(LLMs)在各类自然语言基准测试中展现出卓越性能,这促使人们需要持续构建更困难的数据集以评估更大规模的LLMs,但这一过程成本高昂且耗时。本文提出自动化数据集更新方法,并系统分析其在处理基准泄露问题、难度控制和稳定性方面的有效性。当现有基准被完全掌握或泄露时,可通过更新实现及时可靠的评估。我们提出两种更新策略:1)模仿策略——基于原始数据生成风格与语境本质相似的样本;2)扩展策略——依据布鲁姆教育目标分类学,在不同认知层次上对现有样本进行拓展。在更新的MMLU和BIG-Bench数据集上的大量实验表明,所提策略具有稳定性,且模仿策略能有效缓解因基准泄露导致的高估问题。当高效的模仿策略失效时,扩展策略仍能取得良好效果。通过难度控制,我们能更精准区分模型性能,实现细粒度分析——正如难度失衡的考试无法公平判断学生学习状况。据我们所知,本研究首次实现基准测试的自动化更新以支持可靠及时的评估。演示排行榜详见https://yingjiahao14.github.io/Automating-DatasetUpdates/。