Recently, there has been growing interest in extending the context length of instruction-following models in order to effectively process single-turn long input (e.g. summarizing a paper) and conversations with more extensive histories. While proprietary models such as GPT-4 and Claude have demonstrated considerable advancements in handling tens of thousands of tokens of context, open-sourced models are still in the early stages of experimentation. It also remains unclear whether developing these long context models can offer substantial gains on practical downstream tasks over retrieval-based methods or models simply trained on chunked contexts. To address this challenge, we propose to institute standardized evaluation for long context language models. Concretely, we develop L-Eval which contains 411 long documents and over 2,000 query-response pairs manually annotated and checked by the authors encompassing areas such as law, finance, school lectures, lengthy conversations, news, long-form novels, and meetings. L-Eval also adopts diverse evaluation methods and instruction styles, enabling a more reliable assessment of Long Context Language Models (LCLMs). Our findings indicate that while open-source models typically lag behind their commercial counterparts, they still exhibit impressive performance. LLaMA2 achieves the best results (win 45\% vs turbo-16k) on open-ended tasks with only 4k context length and ChatGLM2 achieves the best results on closed-ended tasks with 8k input tokens. We release our new evaluation suite, code, and all generation results including predictions from all open-sourced LCLMs, GPT4-32k, Cluade-100k at {\url{https://github.com/OpenLMLab/LEval}}.
翻译:近期,扩展指令遵循模型上下文长度以有效处理单轮长输入(如论文摘要)及包含更丰富历史记录的对话引起了广泛关注。尽管GPT-4和Claude等专有模型在处理数万级上下文令牌方面展现出显著进展,但开源模型仍处于实验早期阶段。此外,开发此类长上下文模型相较于基于检索的方法或仅在分块文本上训练的模型,能否在实际下游任务中带来实质性增益尚不明确。为解决这一挑战,我们提出建立长上下文语言模型的标准化评估。具体而言,我们开发了L-Eval,包含411份长文档及2000余条由作者人工标注并核查的查询-响应对,覆盖法律、金融、学校讲座、长对话、新闻、长篇小说及会议等领域。L-Eval还采用多样化评估方法与指令风格,为长上下文语言模型(LCLMs)提供更可靠的评估。研究表明,尽管开源模型通常落后于商业模型,但仍展现出令人瞩目的性能。LLaMA2以仅4k上下文长度在开放式任务中取得最佳结果(胜率45% vs turbo-16k),而ChatGLM2以8k输入令牌在封闭式任务中表现最优。我们已发布新型评估套件、代码及所有开源LCLM、GPT4-32k、Claude-100k的生成结果预测,详见{\url{https://github.com/OpenLMLab/LEval}}。