In evaluating the long-context capabilities of large language models (LLMs), identifying content relevant to a user's query from original long documents is a crucial prerequisite for any LLM to answer questions based on long text. We present NeedleBench, a framework consisting of a series of progressively more challenging tasks for assessing bilingual long-context capabilities, spanning multiple length intervals (4k, 8k, 32k, 128k, 200k, 1000k, and beyond) and different depth ranges, allowing the strategic insertion of critical data points in different text depth zones to rigorously test the retrieval and reasoning capabilities of models in diverse contexts. We use the NeedleBench framework to assess how well the leading open-source models can identify key information relevant to the question and apply that information to reasoning in bilingual long texts. Furthermore, we propose the Ancestral Trace Challenge (ATC) to mimic the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks, providing a simple method for evaluating LLMs in dealing with complex long-context situations. Our results suggest that current LLMs have significant room for improvement in practical long-context applications, as they struggle with the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks. All codes and resources are available at OpenCompass: https://github.com/open-compass/opencompass.
翻译:在评估大语言模型(LLMs)的长上下文能力时,从原始长文档中识别与用户查询相关的内容是任何LLM基于长文本回答问题的重要前提。我们提出了NeedleBench,这是一个由一系列逐步增加难度的任务组成的框架,用于评估双语长上下文能力,涵盖多个长度区间(4k、8k、32k、128k、200k、1000k及以上)和不同深度范围,允许在不同文本深度区域策略性地插入关键数据点,以严格测试模型在多样化上下文中的检索与推理能力。我们使用NeedleBench框架评估了领先的开源模型在双语长文本中识别与问题相关的关键信息并应用该信息进行推理的能力。此外,我们提出了祖先溯源挑战(Ancestral Trace Challenge, ATC),以模拟现实世界长上下文任务中可能存在的逻辑推理挑战的复杂性,为评估LLMs处理复杂长上下文情境提供了一种简单方法。我们的结果表明,当前LLMs在实际长上下文应用中仍有显著的改进空间,因为它们难以应对现实世界长上下文任务中可能存在的复杂逻辑推理挑战。所有代码与资源均发布于OpenCompass平台:https://github.com/open-compass/opencompass。