The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate ten long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only four models (GPT-4, Command-R, Yi-34B, and Mixtral) can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.
翻译:摘要:草堆中的针(NIAH)测试旨在检验从长干扰文本(“草堆”)中检索一条信息(“针”)的能力,已被广泛用于评估长上下文语言模型(LMs)。然而,这种基于简单检索的测试仅能反映一种浅层的长上下文理解能力。为对长上下文LMs进行更全面的评估,我们创建了新的合成基准RULER,其具备灵活的配置,可定制序列长度和任务复杂度。RULER扩展了基础NIAH测试,涵盖了不同类型和数量的“针”的变体。此外,RULER引入了新的任务类别——多跳追踪和聚合,以测试超越上下文搜索的行为。我们使用RULER中的13项代表性任务对十种长上下文LMs进行了评估。尽管在基础NIAH测试中接近完美准确率,所有模型在上下文长度增加时均表现出显著的性能下降。虽然这些模型声称支持32K令牌或更长的上下文规模,但仅有四种模型(GPT-4、Command-R、Yi-34B和Mixtral)能在32K长度下保持令人满意的性能。我们对支持200K上下文长度的Yi-34B的分析表明,随着输入长度和任务复杂度的增加,其性能仍有很大的改进空间。我们开源了RULER,以推动长上下文LMs的全面评估。