LooGLE: Can Long-Context Language Models Understand Long Contexts?

Large language models (LLMs), despite their impressive performance in various language tasks, are typically limited to processing texts within context-window size. This limitation has spurred significant research efforts to enhance LLMs' long-context understanding with high-quality long-sequence benchmarks. However, prior datasets in this regard suffer from shortcomings, such as short context length compared to the context window of modern LLMs; outdated documents that have data leakage problems; and an emphasis on short dependency tasks rather than long dependency tasks. In this paper, we present LooGLE, a Long Context Generic Language Evaluation benchmark for LLMs' long context understanding. LooGLE features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains. Human annotators meticulously crafted more than 1,100 high-quality question-answer pairs to meet the long dependency requirements. These pairs underwent thorough cross-validation, yielding the most precise assessment of LLMs' long dependency capabilities. The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings: (i) commercial models outperformed open-sourced models; (ii) LLMs excelled in short dependency tasks like short question-answering and cloze tasks but struggled with more intricate long dependency tasks; (iii) in-context learning and chaining thoughts offered only marginal improvements; (iv) retrieval-based techniques demonstrated substantial benefits for short question-answering, while strategies for extending context window length had limited impact on long context understanding. As such, LooGLE not only provides a systematic and comprehensive evaluation schema on long-context LLMs, but also sheds light on future development of enhanced models towards "true long-context understanding".

翻译：尽管大型语言模型（LLM）在各种语言任务中表现出色，但其通常只能处理上下文窗口大小内的文本。这一限制促使研究者投入大量努力，通过高质量的长序列基准测试来增强LLM的长上下文理解能力。然而，现有相关数据集存在诸多不足：例如上下文长度相较于现代LLM的上下文窗口过短；文档陈旧导致数据泄露问题；以及过度关注短依赖任务而忽视长依赖任务。本文提出LooGLE——一个用于评估LLM长上下文理解能力的长上下文通用语言评估基准。LooGLE采用2022年后的较新文档，每篇文档超过24,000个词元，并涵盖多领域的6,000个新生成问题。人工标注者精心构建了1,100余组高质量问答对以满足长依赖需求，这些配对经过严格交叉验证，实现了对LLM长依赖能力的最精确评估。基于LooGLE对八个前沿LLM的评估揭示了关键发现：（i）商业模型优于开源模型；（ii）LLM在短问答和完形填空等短依赖任务中表现优异，但在复杂长依赖任务中表现欠佳；（iii）上下文学习和思维链仅带来有限改进；（iv）基于检索的技术对短问答任务效益显著，而扩展上下文窗口长度的策略对长上下文理解影响有限。因此，LooGLE不仅为长上下文LLM提供了系统全面的评估框架，也为未来实现“真正长上下文理解”的增强模型发展指明了方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日