Recent advances in long-context language models (LCLMs), designed to handle extremely long contexts, primarily focus on utilizing external contextual information, often leaving the influence of language models' parametric knowledge underexplored. In this work, we firstly investigate how this parametric knowledge affects content generation and demonstrate that its impact becomes increasingly pronounced as context length extends. Furthermore, we show that the model's ability to utilize parametric knowledge, which we call parametric recall ability, does not improve simultaneously with its ability to leverage contextual knowledge through extrinsic retrieval ability. Moreover, better extrinsic retrieval ability can interfere with the model's parametric recall ability, limiting its full potential. To bridge this gap, we design a simple yet effective Hybrid Needle-in-a-Haystack test that evaluates models based on their capabilities across both abilities, rather than solely emphasizing extrinsic retrieval ability. Our experimental results reveal that Qwen-2.5 models significantly outperform Llama-3.1 models, demonstrating superior potential to combine various abilities. Moreover, even the more powerful Llama-3.1-70B-Instruct model fails to exhibit better performance, highlighting the importance of evaluating models from a dual-ability perspective.
翻译:近期针对处理超长上下文而设计的长上下文语言模型(LCLMs)的进展,主要聚焦于利用外部上下文信息,而往往忽视了语言模型参数化知识的影响。在本研究中,我们首先探究了这种参数化知识如何影响内容生成,并证明其影响随着上下文长度的增加而愈发显著。此外,我们发现模型利用参数化知识的能力(我们称之为参数化召回能力)并未与其通过外部检索能力利用上下文知识的能力同步提升。更重要的是,更强的外部检索能力可能会干扰模型的参数化召回能力,从而限制其发挥全部潜力。为弥合这一差距,我们设计了一种简单而有效的混合“大海捞针”测试,该测试基于模型在两种能力上的综合表现进行评估,而非仅仅强调外部检索能力。我们的实验结果表明,Qwen-2.5 系列模型显著优于 Llama-3.1 系列模型,展现出结合多种能力的卓越潜力。此外,即使是更强大的 Llama-3.1-70B-Instruct 模型也未能表现出更优的性能,这凸显了从双能力视角评估模型的重要性。