Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.
翻译:当前的长上下文基准测试主要聚焦于基于检索的测试,要求大型语言模型(LLM)在大量输入上下文中定位特定信息,例如“大海捞针”(NIAH)基准。长上下文生成是指语言模型生成跨越长段落或文档的连贯且上下文准确的文本的能力。尽管近期研究显示LLM在NIAH及其他基于检索的长上下文基准测试中表现强劲,但用于评估长上下文生成能力的基准测试显著缺乏。为填补这一空白并提供全面评估,我们引入了合成基准测试LongGenBench,它允许灵活配置自定义的生成上下文长度。LongGenBench通过重新设计问题格式并要求LLM以单一、连贯的长上下文答案进行响应,超越了传统基准测试。基于LongGenBench的广泛评估,我们观察到:(1)无论是API访问模型还是开源模型,在长上下文生成场景中均表现出性能下降,下降幅度从1.2%到47.1%不等;(2)不同系列的LLM表现出不同的性能下降趋势,其中Gemini-1.5-Flash模型在API访问模型中下降最少,而Qwen2系列在开源模型中于LongGenBench上下降最少。