AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models

Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for language models: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) novelty (i.e., the benchmark reveals new trends in model rankings not shown by previous benchmarks), and (iii) difficulty (i.e., the benchmark should be difficult for existing models, leaving headroom for future improvement). We operationalize these three desiderata and cast benchmark creation as a search problem, that of finding benchmarks that that satisfy all three desiderata. To tackle this search problem, we present AutoBencher, which uses a language model to automatically search for datasets that meet the three desiderata. AutoBencher uses privileged information (e.g. relevant documents) to construct reliable datasets, and adaptivity with reranking to optimize for the search objective. We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that are on average 27% more novel and 22% more difficult than existing benchmarks. A closer investigation of our constructed datasets shows that we can identify specific gaps in LM knowledge in language models that are not captured by existing benchmarks, such as Gemini Pro performing much worse on question answering about the Permian Extinction and Fordism, while OpenAGI-7B performing surprisingly well on QA about COVID-19.

翻译：评估对于衡量能力、追踪科学进展以及指导模型选择至关重要。本文提出了优秀语言模型基准应满足的三个期望特性：(i) 显著性（例如，关于第二次世界大战的知识比历史上随机一天的知识更显著），(ii) 新颖性（即基准应能揭示先前基准未能展现的模型排名新趋势），以及 (iii) 困难性（即基准应对现有模型具有挑战性，为未来改进留出空间）。我们将这三个期望特性具体化，并将基准创建视为一个搜索问题，即寻找满足所有三个期望特性的基准。为解决此搜索问题，我们提出了 AutoBencher，它利用语言模型自动搜索满足这三个期望特性的数据集。AutoBencher 利用特权信息（例如相关文档）构建可靠的数据集，并通过自适应重排序来优化搜索目标。我们使用 AutoBencher 为数学、多语言和知识密集型问答任务创建数据集。AutoBencher 的可扩展性使其能够测试细粒度类别和长尾知识，所创建的数据集平均比现有基准新颖性高出 27%，困难性高出 22%。对我们构建的数据集的深入分析表明，我们能够识别出现有基准未能捕捉的语言模型知识特定缺口，例如 Gemini Pro 在关于二叠纪大灭绝和福特主义的问答上表现明显更差，而 OpenAGI-7B 在关于 COVID-19 的问答上表现却出人意料地好。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日