NoLiMa：超越字面匹配的长上下文评估 (NoLiMa: Long-Context Evaluation Beyond Literal Matching)

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.

翻译：近期的大型语言模型（LLMs）已支持128K至1M令牌的长上下文处理能力。评估此类能力的主流方法是“大海捞针”（NIAH）测试，即从“干草堆”（长段无关上下文）中检索“针”（相关信息）。该方法的扩展包括增加干扰项、事实链式推理及上下文推理等。然而，在这些基准测试中，模型可能利用针与干草堆之间已有的字面匹配来简化任务。为此，我们提出NoLiMa基准，通过精心设计的针集合对NIAH进行扩展：该集合中问题与针的词汇重叠度极低，要求模型通过推断潜在关联在干草堆中定位目标针。我们对12款宣称支持至少128K上下文的主流LLMs进行评估。这些模型在短上下文（<1K）中表现良好，但随着上下文长度增加，性能显著下降。例如在32K长度下，10款模型的性能降至其强劲短长度基线的50%以下。即使是表现最优异的GPT-4o，其准确率也从接近完美的基线值99.3%下降至69.7%。分析表明，当缺乏字面匹配时，注意力机制在长上下文中面临更大挑战，导致相关信息检索困难加剧。我们已公开数据集与评估代码：https://github.com/adobe-research/NoLiMa。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日