LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.

翻译：尽管大型语言模型（LLMs）在许多语言任务中展现出卓越性能，但多数模型仅能处理数千词级别的文本，限制了其在书籍、报告和代码库等长序列输入中的应用。近期研究通过扩展上下文窗口及更复杂的记忆机制，提出了改进LLMs长上下文能力的方法，但针对评估长上下文理解的综合性基准仍然缺失。本文提出LongBench——首个面向长上下文理解的双语多任务基准，以实现对长上下文理解的更严格评估。LongBench包含覆盖6个任务类别的21个中英文数据集，平均长度为6,711个英文词和13,386个中文字符。这些任务涵盖单文档问答、多文档问答、摘要、少样本学习、合成任务及代码补全等关键长文本应用领域。LongBench中所有数据集均被标准化为统一格式，支持对LLMs进行便捷的自动评估。通过对8个LLMs在LongBench上的全面评估，我们发现：（1）商业模型（GPT-3.5-Turbo-16k）优于其他开源模型，但在更长上下文中仍存在挑战；（2）缩放位置嵌入与长序列微调显著提升长上下文理解能力；（3）检索等上下文压缩技术可增强模型处理长上下文的能力，但其性能仍落后于具备强长上下文理解能力的模型。代码与数据集已开源至https://github.com/THUDM/LongBench。