Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.
翻译:测试集污染,即基准测试中的测试数据出现在新模型的训练集中,是公平评估大型语言模型(LLM)时一个已有充分记录的障碍,并可能迅速导致基准测试过时。为缓解此问题,许多近期基准测试通过众包方式从人类或LLM评判者处收集新提示和评估;然而,这种方法可能引入显著偏差,并且在评估难题时失效。本工作中,我们引入了一个为LLM设计的新基准测试,旨在同时对测试集污染以及LLM评判和人类众包的缺陷具有免疫力。我们发布了LiveBench,这是首个满足以下条件的基准测试:(1)包含来自近期信息源的频繁更新的问题,(2)根据客观真实值自动评分答案,(3)涵盖多种具有挑战性的任务,包括数学、编程、推理、语言、指令遵循和数据分析。为实现这些目标,LiveBench中的问题基于近期发布的数学竞赛、arXiv论文、新闻文章和数据集,并包含了来自先前基准测试(如Big-Bench Hard、AMPS和IFEval)中任务的、难度更高且无污染的版本。我们评估了许多知名的闭源模型,以及数十个参数量从0.5B到110B不等的开源模型。LiveBench难度很高,顶级模型的准确率低于65%。我们发布了所有问题、代码和模型答案。问题将每月添加和更新,并且我们将随时间推移发布新任务和更难版本的任务,以便LiveBench能够在未来LLM能力提升时有效区分其性能差异。我们欢迎社区参与和协作,以扩展基准测试任务和模型。