While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.
翻译:尽管近期语言模型具备处理长上下文输入的能力,但关于它们如何有效利用更长上下文的信息仍知之甚少。我们分析了语言模型在两项需要从输入上下文中识别相关信息的任务上的表现:多文档问答与键值检索。研究发现,当相关信息的位置发生变化时,模型性能可能显著下降,这表明当前语言模型并未稳健地利用长输入上下文中的信息。特别值得注意的是,我们观察到当相关信息出现在输入上下文的开头或结尾时,性能往往最优;而模型必须从长上下文中间获取相关信息时,即便针对显式长上下文模型,其性能也会显著降低。我们的分析为理解语言模型如何利用输入上下文提供了更深入的见解,并为未来的长上下文语言模型建立了新的评估基准。