There have been many benchmarks for evaluating long-context language models (LCLMs), but developers often rely on synthetic tasks like needle-in-a-haystack (NIAH) or arbitrary subsets of tasks. It remains unclear whether they translate to the diverse downstream applications of LCLMs, and the inconsistency further complicates model comparison. We investigate the underlying reasons behind current practices and find that existing benchmarks often provide noisy signals due to low coverage of applications, insufficient lengths, unreliable metrics, and incompatibility with base models. In this work, we present HELMET (How to Evaluate Long-context Models Effectively and Thoroughly), a comprehensive benchmark encompassing seven diverse, application-centric categories. We also address many issues in previous benchmarks by adding controllable lengths up to 128k tokens, model-based evaluation for reliable metrics, and few-shot prompting for robustly evaluating base models. Consequently, we demonstrate that HELMET offers more reliable and consistent rankings of frontier LCLMs. Through a comprehensive study of 51 LCLMs, we find that (1) synthetic tasks like NIAH are not good predictors of downstream performance; (2) the diverse categories in HELMET exhibit distinct trends and low correlation with each other; and (3) while most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when the task requires full-context reasoning or following complex instructions -- the gap widens with increased lengths. Finally, we recommend using our RAG tasks for fast model development, as they are easy to run and more predictive of other downstream performance; ultimately, we advocate for a holistic evaluation across diverse tasks.
翻译:已有许多用于评估长上下文语言模型(LCLMs)的基准测试,但开发者通常依赖诸如“大海捞针”(NIAH)之类的合成任务或任意的任务子集。目前尚不清楚这些测试是否能转化为LCLMs多样化的下游应用,且评估方式的不一致性进一步加剧了模型比较的复杂性。我们探究了当前实践背后的根本原因,发现现有基准测试常因应用覆盖范围不足、长度不够、指标不可靠以及与基础模型不兼容等问题而提供嘈杂的信号。在本工作中,我们提出了HELMET(如何有效且全面地评估长上下文模型),这是一个涵盖七个多样化、以应用为中心的类别的综合性基准。我们还通过添加可控长度(最高至128k词元)、基于模型的评估以获取可靠指标,以及使用少样本提示来稳健评估基础模型,从而解决了先前基准中的诸多问题。因此,我们证明了HELMET能为前沿LCLMs提供更可靠、更一致的排名。通过对51个LCLMs的全面研究,我们发现:(1)NIAH等合成任务不能很好地预测下游性能;(2)HELMET中的多样化类别展现出不同的趋势且彼此间相关性较低;(3)虽然大多数LCLMs在NIAH任务上取得了满分,但当任务需要全上下文推理或遵循复杂指令时,开源模型显著落后于闭源模型——且随着长度增加,差距进一步扩大。最后,我们建议使用我们的RAG任务进行快速模型开发,因为它们易于运行且更能预测其他下游性能;最终,我们倡导在多样化任务上进行整体评估。