RULER: What's the Real Context Size of Your Long-Context Language Models?

The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate 17 long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.

翻译：针在干草堆（NIAH）测试通过考察从长干扰文本（“干草堆”）中检索特定信息（“针”）的能力，已被广泛用于评估长上下文语言模型。然而，这种基于检索的简单测试仅能反映一种表面化的长上下文理解形式。为提供对长上下文语言模型更全面的评估，我们创建了新的合成基准RULER，其具有可灵活配置的序列长度和任务复杂度。RULER在原始NIAH测试基础上进行了扩展，涵盖了具有不同类型和数量“针”的多种变体。此外，RULER引入了多跳追踪和聚合等新任务类别，以测试超越上下文搜索的行为模式。我们使用RULER中的13个代表性任务评估了17个长上下文语言模型。尽管这些模型在原始NIAH测试中取得了接近完美的准确率，但当上下文长度增加时，几乎所有模型都表现出明显的性能下降。虽然这些模型均宣称支持32K或更多标记的上下文长度，但仅有半数能在32K长度下保持令人满意的性能。我们对支持200K上下文长度的Yi-34B模型的分析表明，随着输入长度和任务复杂度的增加，模型仍有巨大改进空间。我们开源RULER以促进对长上下文语言模型的全面评估。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日