Recently, there has been growing interest in extending the context length of instruction-following models in order to effectively process single-turn long input (e.g. summarizing a paper) and conversations with more extensive histories. While proprietary models such as GPT-4 and Claude have shown significant strides in handling extremely lengthy input, open-sourced models are still in the early stages of experimentation. It also remains unclear whether extending the context can offer substantial gains over traditional methods such as retrieval, and to what extent it improves upon their regular counterparts in practical downstream tasks. To address this challenge, we propose instituting standardized evaluation for long context language models. Concretely, we develop L-Eval which contains 411 long documents and over 2,000 human-labeled query-response pairs encompassing areas such as law, finance, school lectures, lengthy conversations, news, long-form novels, and meetings. L-Eval also adopts diverse evaluation methods and instruction styles, enabling a more reliable assessment of Long Context Language Models (LCLMs). Our findings indicate that while open-source models typically lag behind commercial models, they still exhibit impressive performance compared with their regular versions. LLaMA2-13B achieves the best results on both open-ended tasks (win \textbf{42}\% vs turbo-16k-0613) and closed-ended tasks with only 4k context length. We release our new evaluation suite, code, and all generation results including predictions from all open-sourced LCLMs, GPT4-32k, Cluade-100k at {\url{https://github.com/OpenLMLab/LEval}}.
翻译:近期,人们对扩展指令遵循模型的上下文长度以有效处理单轮长输入(例如总结论文)和具有更广泛历史记录的对话产生了日益增长的兴趣。虽然GPT-4和Claude等专有模型在处理极长输入方面取得了显著进展,但开源模型仍处于实验的早期阶段。此外,扩展上下文是否能在实际下游任务中比检索等传统方法带来实质性收益,以及在多大程度上超越其常规版本,目前仍不明确。为解决这一挑战,我们提出为长上下文语言模型建立标准化评估。具体而言,我们开发了L-Eval,其中包含411份长文档和超过2000个人工标注的查询-响应对,涵盖法律、金融、学校讲座、长对话、新闻、长篇小说和会议等领域。L-Eval还采用了多样化的评估方法和指令风格,从而能够更可靠地评估长上下文语言模型(LCLMs)。我们的研究结果表明,尽管开源模型通常落后于商业模型,但与它们的常规版本相比,它们仍然表现出令人印象深刻的性能。LLaMA2-13B在仅4k上下文长度的条件下,在开放式任务(胜率\textbf{42}\% vs turbo-16k-0613)和封闭式任务上均取得了最佳结果。我们发布了新的评估套件、代码以及所有生成结果,包括来自所有开源LCLMs、GPT4-32k、Claude-100k的预测,网址为:\url{https://github.com/OpenLMLab/LEval}。