Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models

Large language models (LLMs) have achieved impressive performance across various natural language benchmarks, prompting a continual need to curate more difficult datasets for larger LLMs, which is costly and time-consuming. In this paper, we propose to automate dataset updating and provide systematic analysis regarding its effectiveness in dealing with benchmark leakage issue, difficulty control, and stability. Thus, once the current benchmark has been mastered or leaked, we can update it for timely and reliable evaluation. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, preserving stylistic and contextual essence, and 2) extending strategy that further expands existing samples at varying cognitive levels by adapting Bloom's taxonomy of educational objectives. Extensive experiments on updated MMLU and BIG-Bench demonstrate the stability of the proposed strategies and find that the mimicking strategy can effectively alleviate issues of overestimation from benchmark leakage. In cases where the efficient mimicking strategy fails, our extending strategy still shows promising results. Additionally, by controlling the difficulty, we can better discern the models' performance and enable fine-grained analysis neither too difficult nor too easy an exam can fairly judge students' learning status. To the best of our knowledge, we are the first to automate updating benchmarks for reliable and timely evaluation. Our demo leaderboard can be found at https://yingjiahao14.github.io/Automating-DatasetUpdates/.

翻译：大型语言模型（LLMs）在各类自然语言基准测试中展现出卓越性能，这促使人们需要持续构建更困难的数据集以评估更大规模的LLMs，但这一过程成本高昂且耗时。本文提出自动化数据集更新方法，并系统分析其在处理基准泄露问题、难度控制和稳定性方面的有效性。当现有基准被完全掌握或泄露时，可通过更新实现及时可靠的评估。我们提出两种更新策略：1）模仿策略——基于原始数据生成风格与语境本质相似的样本；2）扩展策略——依据布鲁姆教育目标分类学，在不同认知层次上对现有样本进行拓展。在更新的MMLU和BIG-Bench数据集上的大量实验表明，所提策略具有稳定性，且模仿策略能有效缓解因基准泄露导致的高估问题。当高效的模仿策略失效时，扩展策略仍能取得良好效果。通过难度控制，我们能更精准区分模型性能，实现细粒度分析——正如难度失衡的考试无法公平判断学生学习状况。据我们所知，本研究首次实现基准测试的自动化更新以支持可靠及时的评估。演示排行榜详见https://yingjiahao14.github.io/Automating-DatasetUpdates/。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日