LongGenBench: Long-context Generation Benchmark

Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.

翻译：当前的长上下文基准测试主要聚焦于基于检索的测试，要求大型语言模型（LLM）在大量输入上下文中定位特定信息，例如“大海捞针”（NIAH）基准。长上下文生成是指语言模型生成跨越长段落或文档的连贯且上下文准确的文本的能力。尽管近期研究显示LLM在NIAH及其他基于检索的长上下文基准测试中表现强劲，但用于评估长上下文生成能力的基准测试显著缺乏。为填补这一空白并提供全面评估，我们引入了合成基准测试LongGenBench，它允许灵活配置自定义的生成上下文长度。LongGenBench通过重新设计问题格式并要求LLM以单一、连贯的长上下文答案进行响应，超越了传统基准测试。基于LongGenBench的广泛评估，我们观察到：（1）无论是API访问模型还是开源模型，在长上下文生成场景中均表现出性能下降，下降幅度从1.2%到47.1%不等；（2）不同系列的LLM表现出不同的性能下降趋势，其中Gemini-1.5-Flash模型在API访问模型中下降最少，而Qwen2系列在开源模型中于LongGenBench上下降最少。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日