The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction & Single Task (GIST), Local Instruction & Single Task (LIST), and Local Instruction & Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).
翻译:近年来,大语言模型(LLMs)的长上下文能力一直是研究热点。为评估LLMs在不同场景下的性能,各类评估基准相继出现。然而,由于这些基准大多侧重于通过识别关键信息来回答问题,主要依赖LLMs的检索能力,它们仅能部分反映LLMs从海量信息中进行推理的性能。同时,尽管LLMs常宣称具备32k、128k、200k甚至更长的上下文窗口,现有基准却未能有效揭示这些模型实际支持的上下文长度。为应对这些问题,我们提出了LongIns基准数据集——一个基于现有指令数据集构建的、面向LLMs的挑战性长上下文指令考试。具体而言,我们在LongIns中引入了三种评估设置:全局指令与单任务(GIST)、局部指令与单任务(LIST)以及局部指令与多任务(LIMT)。基于LongIns,我们对现有LLMs进行了全面评估,并得出以下重要发现:(1)在128k上下文窗口下表现最优的GPT-4,在LongIns的16k评估上下文窗口中表现欠佳。(2)对于多数现有LLMs的多跳推理能力,在短上下文窗口(小于4k)下仍需显著提升。