While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze language model performance on two tasks that require identifying relevant information within their input contexts: multi-document question answering and key-value retrieval. We find that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts. Furthermore, performance substantially decreases as the input context grows longer, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context models.
翻译:尽管近期语言模型能够将长文本作为输入进行处理,但关于它们如何有效利用较长上下文的研究仍较为有限。我们分析了语言模型在两项需要从输入上下文中识别相关信息的任务中的表现:多文档问答和键值检索。研究发现,当相关信息出现在输入上下文的开头或结尾时,模型性能通常最高;而当模型必须从长文本中间获取相关信息时,性能会显著下降。此外,随着输入上下文长度的增加,模型性能明显降低,即便是专门为长上下文设计的模型也不例外。我们的分析深化了对语言模型如何利用输入上下文的理解,并为未来的长上下文模型提供了新的评估方案。