LiveBench: A Challenging, Contamination-Free LLM Benchmark

Colin White,Samuel Dooley,Manley Roberts,Arka Pal,Ben Feuer,Siddhartha Jain,Ravid Shwartz-Ziv,Neel Jain,Khalid Saifullah,Siddartha Naidu,Chinmay Hegde,Yann LeCun,Tom Goldstein,Willie Neiswanger,Micah Goldblum

Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

翻译：测试集污染，即基准测试中的测试数据出现在新模型的训练集中，是公平评估大型语言模型（LLM）时一个已有充分记录的障碍，并可能迅速导致基准测试过时。为缓解此问题，许多近期基准测试通过众包方式从人类或LLM评判者处收集新提示和评估；然而，这种方法可能引入显著偏差，并且在评估难题时失效。本工作中，我们引入了一个为LLM设计的新基准测试，旨在同时对测试集污染以及LLM评判和人类众包的缺陷具有免疫力。我们发布了LiveBench，这是首个满足以下条件的基准测试：（1）包含来自近期信息源的频繁更新的问题，（2）根据客观真实值自动评分答案，（3）涵盖多种具有挑战性的任务，包括数学、编程、推理、语言、指令遵循和数据分析。为实现这些目标，LiveBench中的问题基于近期发布的数学竞赛、arXiv论文、新闻文章和数据集，并包含了来自先前基准测试（如Big-Bench Hard、AMPS和IFEval）中任务的、难度更高且无污染的版本。我们评估了许多知名的闭源模型，以及数十个参数量从0.5B到110B不等的开源模型。LiveBench难度很高，顶级模型的准确率低于65%。我们发布了所有问题、代码和模型答案。问题将每月添加和更新，并且我们将随时间推移发布新任务和更难版本的任务，以便LiveBench能够在未来LLM能力提升时有效区分其性能差异。我们欢迎社区参与和协作，以扩展基准测试任务和模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日