As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.
翻译:随着大语言模型(LLMs)的上下文限制不断扩展,其潜在应用范围和下游功能也日益拓宽。在许多现实任务中,决策依赖于分散在大量通常互不相关的文档集合中的细节信息,而这些文档大多包含无关内容。长上下文LLMs似乎非常适合这种形式的复杂信息检索与推理任务——这类任务传统上被证明成本高昂且耗时。然而,尽管近年来长上下文模型的开发取得了快速进展,我们对LLMs如何有效利用其上下文的理解却未能同步跟进。为此,我们设计了一系列检索实验,旨在评估17个主流LLM的能力特性,例如它们通过上下文窗口追踪信息线索的能力。引人注目的是,我们发现许多模型具有显著的线索保持能力:能够同时追踪多条线索而不会出现显著的性能下降。尽管如此,对于许多模型而言,其有效上下文限制远低于官方支持的上下文长度,且准确率会随着上下文窗口的扩展而下降。我们的研究还强调了一个重要观点:不同分词器的词元数量不应直接比较——它们通常对应着显著不同的实际字符数量。我们已公开本研究的代码及长上下文实验数据。