RULER: What's the Real Context Size of Your Long-Context Language Models?

The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate ten long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only four models (GPT-4, Command-R, Yi-34B, and Mixtral) can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.

翻译：摘要：草堆中的针（NIAH）测试旨在检验从长干扰文本（“草堆”）中检索一条信息（“针”）的能力，已被广泛用于评估长上下文语言模型（LMs）。然而，这种基于简单检索的测试仅能反映一种浅层的长上下文理解能力。为对长上下文LMs进行更全面的评估，我们创建了新的合成基准RULER，其具备灵活的配置，可定制序列长度和任务复杂度。RULER扩展了基础NIAH测试，涵盖了不同类型和数量的“针”的变体。此外，RULER引入了新的任务类别——多跳追踪和聚合，以测试超越上下文搜索的行为。我们使用RULER中的13项代表性任务对十种长上下文LMs进行了评估。尽管在基础NIAH测试中接近完美准确率，所有模型在上下文长度增加时均表现出显著的性能下降。虽然这些模型声称支持32K令牌或更长的上下文规模，但仅有四种模型（GPT-4、Command-R、Yi-34B和Mixtral）能在32K长度下保持令人满意的性能。我们对支持200K上下文长度的Yi-34B的分析表明，随着输入长度和任务复杂度的增加，其性能仍有很大的改进空间。我们开源了RULER，以推动长上下文LMs的全面评估。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日