While recent research endeavors have focused on developing Large Language Models (LLMs) with robust long-context capabilities, due to the lack of long-context benchmarks, relatively little is known about how well the performance of long-context LLMs. To address this gap, we propose a multi-evidence, position-aware, and scalable benchmark for evaluating long-context LLMs, named Counting-Stars, which evaluates long-context LLMs by using two tasks: multi-evidence acquisition and multi-evidence reasoning. Based on the Counting-Stars test, we conduct experiments to evaluate long-context LLMs (i.e., GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1). Experimental results demonstrate that Gemini 1.5 Pro achieves the best overall results, while the performance of GPT-4 Turbo is the most stable across various tasks. Furthermore, our analysis of these LLMs, which are extended to handle long-context scenarios, indicates that there is potential for improvement as the length of the input context and the intricacy of the tasks are increasing.
翻译:尽管近期研究致力于开发具有强大长上下文能力的大语言模型(LLMs),但由于缺乏长上下文基准,对于长上下文LLMs的性能表现知之甚少。为解决这一缺口,我们提出了一种多证据、位置感知且可扩展的长上下文LLMs评估基准——Counting-Stars,该基准通过两项任务(多证据获取与多证据推理)来评估长上下文LLMs。基于Counting-Stars测试,我们开展了针对长上下文LLMs(包括GPT-4 Turbo、Gemini 1.5 Pro、Claude3 Opus、GLM-4及Moonshot-v1)的实验评估。实验结果表明:Gemini 1.5 Pro在整体性能上表现最优,而GPT-4 Turbo在不同任务中表现出最稳定的性能。此外,通过对这些经扩展以处理长上下文场景的LLMs进行分析,我们发现随着输入上下文长度和任务复杂度的增加,其性能仍有提升空间。