Processing and reasoning over long contexts is crucial for many practical applications of Large Language Models (LLMs), such as document comprehension and agent construction. Despite recent strides in making LLMs process contexts with more than 100K tokens, there is currently a lack of a standardized benchmark to evaluate this long-context capability. Existing public benchmarks typically focus on contexts around 10K tokens, limiting the assessment and comparison of LLMs in processing longer contexts. In this paper, we propose $\infty$Bench, the first LLM benchmark featuring an average data length surpassing 100K tokens. $\infty$Bench comprises synthetic and realistic tasks spanning diverse domains, presented in both English and Chinese. The tasks in $\infty$Bench are designed to require well understanding of long dependencies in contexts, and make simply retrieving a limited number of passages from contexts not sufficient for these tasks. In our experiments, based on $\infty$Bench, we evaluate the state-of-the-art proprietary and open-source LLMs tailored for processing long contexts. The results indicate that existing long context LLMs still require significant advancements to effectively process 100K+ context. We further present three intriguing analyses regarding the behavior of LLMs processing long context.
翻译:处理和推理长上下文对于大型语言模型(LLMs)的许多实际应用(如文档理解与智能体构建)至关重要。尽管近期在使LLMs能够处理超过10万词元的上下文方面取得了进展,但目前仍缺乏标准化的基准测试来评估这种长上下文能力。现有的公开基准测试通常聚焦于约1万词元的上下文,限制了LLMs在处理更长上下文时的评估与比较。本文提出了$\infty$Bench,这是首个平均数据长度超过10万词元的LLM基准测试。$\infty$Bench包含涵盖不同领域的合成与真实任务,以中英文两种语言呈现。其任务设计需要充分理解上下文中的长程依赖关系,仅从上下文中检索少量段落不足以完成这些任务。基于$\infty$Bench的实验评估了当前为处理长上下文而优化的最先进闭源与开源LLMs。结果表明,现有长上下文LLMs仍需显著改进才能有效处理10万词元以上的上下文。此外,我们进一步呈现了三项关于LLMs处理长上下文行为的有趣分析。