The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate 17 long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.
翻译:针在干草堆(NIAH)测试通过考察从长干扰文本(“干草堆”)中检索特定信息(“针”)的能力,已被广泛用于评估长上下文语言模型。然而,这种基于检索的简单测试仅能反映一种表面化的长上下文理解形式。为提供对长上下文语言模型更全面的评估,我们创建了新的合成基准RULER,其具有可灵活配置的序列长度和任务复杂度。RULER在原始NIAH测试基础上进行了扩展,涵盖了具有不同类型和数量“针”的多种变体。此外,RULER引入了多跳追踪和聚合等新任务类别,以测试超越上下文搜索的行为模式。我们使用RULER中的13个代表性任务评估了17个长上下文语言模型。尽管这些模型在原始NIAH测试中取得了接近完美的准确率,但当上下文长度增加时,几乎所有模型都表现出明显的性能下降。虽然这些模型均宣称支持32K或更多标记的上下文长度,但仅有半数能在32K长度下保持令人满意的性能。我们对支持200K上下文长度的Yi-34B模型的分析表明,随着输入长度和任务复杂度的增加,模型仍有巨大改进空间。我们开源RULER以促进对长上下文语言模型的全面评估。