Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.

翻译：随着大语言模型（LLMs）的上下文限制不断扩展，其潜在应用范围和下游功能也日益拓宽。在许多现实任务中，决策依赖于分散在大量通常互不相关的文档集合中的细节信息，而这些文档大多包含无关内容。长上下文LLMs似乎非常适合这种形式的复杂信息检索与推理任务——这类任务传统上被证明成本高昂且耗时。然而，尽管近年来长上下文模型的开发取得了快速进展，我们对LLMs如何有效利用其上下文的理解却未能同步跟进。为此，我们设计了一系列检索实验，旨在评估17个主流LLM的能力特性，例如它们通过上下文窗口追踪信息线索的能力。引人注目的是，我们发现许多模型具有显著的线索保持能力：能够同时追踪多条线索而不会出现显著的性能下降。尽管如此，对于许多模型而言，其有效上下文限制远低于官方支持的上下文长度，且准确率会随着上下文窗口的扩展而下降。我们的研究还强调了一个重要观点：不同分词器的词元数量不应直接比较——它们通常对应着显著不同的实际字符数量。我们已公开本研究的代码及长上下文实验数据。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日