Have Seen Me Before? Automating Dataset Updates Towards Reliable and Timely Evaluation

Due to the expanding capabilities and pre-training data, Large Language Models (LLMs) are facing increasingly serious evaluation challenges. On one hand, the data leakage issue cause over-estimation on existing benchmarks. On the other hand, periodically curating datasets manually is costly. In this paper, we propose to automate dataset updates for reliable and timely evaluation. The basic idea is to generate unseen and high-quality testing samples based on existing ones to mitigate leakage issues. In specific, we propose two strategies with systematically verification. First, the mimicking strategy employs LLMs to create new samples resembling existing ones, to the maximum extent preserving the stylistic of the original dataset. Our experiments demonstrate its evaluation stability across multiple instantiations and its effectiveness in dealing with data leakage issues in most cases. Second, for the cases that mimicking dataset works poorly, we design an extending strategy that adjusts the difficulty of the generated samples according to varying cognitive levels. This not only makes our evaluation more systematic, but also, with a balanced difficulty, even discern model capabilities better at fine-grained levels.

翻译：由于大语言模型（LLMs）能力与预训练数据的持续扩展，其评估正面临日益严峻的挑战。一方面，数据泄露问题导致现有基准测试出现过高估计；另一方面，人工定期整理数据集成本高昂。本文提出通过自动化数据集更新实现可靠且及时的评估。其核心思想是基于现有测试样本生成未见且高质量的新样本以缓解泄露问题。具体而言，我们提出两种策略并进行了系统性验证。首先，模仿策略利用LLMs生成与现有样本相似的新样本，最大程度保留原始数据集的风格特征。实验表明，该策略在多次实例化中均保持评估稳定性，且能有效应对大多数情况下的数据泄露问题。其次，针对模仿数据集表现不佳的场景，我们设计了扩展策略，根据认知层次差异调整生成样本的难度。这不仅使评估更具系统性，还能通过平衡难度，在细粒度层面更有效地判别模型能力。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日