Large language models (LLMs), despite their impressive performance in various language tasks, are typically limited to processing texts within context-window size. This limitation has spurred significant research efforts to enhance LLMs' long-context understanding with high-quality long-sequence benchmarks. However, prior datasets in this regard suffer from shortcomings, such as short context length compared to the context window of modern LLMs; outdated documents that have data leakage problems; and an emphasis on short dependency tasks rather than long dependency tasks. In this paper, we present LooGLE, a Long Context Generic Language Evaluation benchmark for LLMs' long context understanding. LooGLE features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. Human annotators meticulously crafted more than 1,100 high-quality question-answer pairs to meet the long dependency requirements. These pairs underwent thorough cross-validation, yielding the most precise assessment of LLMs' long dependency capabilities. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings: (i) commercial models outperformed open-sourced models; (ii) LLMs excelled in short dependency tasks like short question-answering and cloze tasks but struggled with more intricate long dependency tasks; (iii) in-context learning and chaining thoughts offered only marginal improvements; (iv) retrieval-based techniques demonstrated substantial benefits for short question-answering, while strategies for extending context window length had limited impact on long context understanding. As such, LooGLE not only provides a systematic and comprehensive evaluation schema on long-context LLMs, but also sheds light on future development of enhanced models towards "true long-context understanding".
翻译:尽管大型语言模型(LLM)在各种语言任务中表现出色,但其通常只能处理上下文窗口大小内的文本。这一限制促使研究者投入大量努力,通过高质量的长序列基准测试来增强LLM的长上下文理解能力。然而,现有相关数据集存在诸多不足:例如上下文长度相较于现代LLM的上下文窗口过短;文档陈旧导致数据泄露问题;以及过度关注短依赖任务而忽视长依赖任务。本文提出LooGLE——一个用于评估LLM长上下文理解能力的长上下文通用语言评估基准。LooGLE采用2022年后的较新文档,每篇文档超过24,000个词元,并涵盖多领域的6,000个新生成问题。人工标注者精心构建了1,100余组高质量问答对以满足长依赖需求,这些配对经过严格交叉验证,实现了对LLM长依赖能力的最精确评估。基于LooGLE对八个前沿LLM的评估揭示了关键发现:(i)商业模型优于开源模型;(ii)LLM在短问答和完形填空等短依赖任务中表现优异,但在复杂长依赖任务中表现欠佳;(iii)上下文学习和思维链仅带来有限改进;(iv)基于检索的技术对短问答任务效益显著,而扩展上下文窗口长度的策略对长上下文理解影响有限。因此,LooGLE不仅为长上下文LLM提供了系统全面的评估框架,也为未来实现“真正长上下文理解”的增强模型发展指明了方向。