Despite recent efforts to develop large language models with robust long-context capabilities, the lack of long-context benchmarks means that relatively little is known about their performance. To alleviate this gap, in this paper, we propose \textbf{Counting-Stars}, a multi-evidence, position-aware, and scalable benchmark designed to evaluate the multi-evidence retrieval capabilities of long-context LLMs. \textbf{Counting-Stars} comprises two counting-based multiple pieces of evidence retrieval tasks: searching and reasoning. Using Counting-Stars, we conducted experiments to evaluate several long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo exhibits the most stable performance across various tasks. Furthermore, our analysis of these LLMs, which have been extended to handle long-context scenarios, indicates that significant room for improvement remains as the length of the input context and the complexity of the tasks increase.
翻译:尽管近期在开发具有强大长上下文能力的大语言模型方面做出了诸多努力,但由于缺乏长上下文基准,人们对其性能知之甚少。为弥补这一不足,本文提出 **Counting-Stars**,这是一个多证据、位置感知且可扩展的基准,旨在评估长上下文大语言模型的多证据检索能力。**Counting-Stars** 包含两项基于计数的多证据检索任务:搜索与推理。利用 Counting-Stars,我们对多个长上下文大语言模型进行了实验评估,包括 GPT-4 Turbo、Gemini 1.5 Pro、Claude3 Opus、GLM-4 和 Moonshot-v1。大量实验结果表明,Gemini 1.5 Pro 取得了最佳的综合表现,而 GPT-4 Turbo 在不同任务中展现出最稳定的性能。此外,我们对这些已扩展以处理长上下文场景的大语言模型的分析表明,随着输入上下文长度的增加和任务复杂性的提升,模型仍有显著的改进空间。