L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Recently, there has been growing interest in extending the context length of instruction-following models in order to effectively process single-turn long input (e.g. summarizing a paper) and conversations with more extensive histories. While proprietary models such as GPT-4 and Claude have shown significant strides in handling extremely lengthy input, open-sourced models are still in the early stages of experimentation. It also remains unclear whether extending the context can offer substantial gains over traditional methods such as retrieval, and to what extent it improves upon their regular counterparts in practical downstream tasks. To address this challenge, we propose instituting standardized evaluation for long context language models. Concretely, we develop L-Eval which contains 411 long documents and over 2,000 human-labeled query-response pairs encompassing areas such as law, finance, school lectures, lengthy conversations, news, long-form novels, and meetings. L-Eval also adopts diverse evaluation methods and instruction styles, enabling a more reliable assessment of Long Context Language Models (LCLMs). Our findings indicate that while open-source models typically lag behind commercial models, they still exhibit impressive performance compared with their regular versions. LLaMA2-13B achieves the best results on both open-ended tasks (win \textbf{42}\% vs turbo-16k-0613) and closed-ended tasks with only 4k context length. We release our new evaluation suite, code, and all generation results including predictions from all open-sourced LCLMs, GPT4-32k, Cluade-100k at {\url{https://github.com/OpenLMLab/LEval}}.

翻译：近期，人们对扩展指令遵循模型的上下文长度以有效处理单轮长输入（例如总结论文）和具有更广泛历史记录的对话产生了日益增长的兴趣。虽然GPT-4和Claude等专有模型在处理极长输入方面取得了显著进展，但开源模型仍处于实验的早期阶段。此外，扩展上下文是否能在实际下游任务中比检索等传统方法带来实质性收益，以及在多大程度上超越其常规版本，目前仍不明确。为解决这一挑战，我们提出为长上下文语言模型建立标准化评估。具体而言，我们开发了L-Eval，其中包含411份长文档和超过2000个人工标注的查询-响应对，涵盖法律、金融、学校讲座、长对话、新闻、长篇小说和会议等领域。L-Eval还采用了多样化的评估方法和指令风格，从而能够更可靠地评估长上下文语言模型（LCLMs）。我们的研究结果表明，尽管开源模型通常落后于商业模型，但与它们的常规版本相比，它们仍然表现出令人印象深刻的性能。LLaMA2-13B在仅4k上下文长度的条件下，在开放式任务（胜率\textbf{42}\% vs turbo-16k-0613）和封闭式任务上均取得了最佳结果。我们发布了新的评估套件、代码以及所有生成结果，包括来自所有开源LCLMs、GPT4-32k、Claude-100k的预测，网址为：\url{https://github.com/OpenLMLab/LEval}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日