Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length. Recently, multiple studies have committed to extending the context length and enhancing the long text modeling capabilities of LLMs. To comprehensively evaluate the long context ability of LLMs, we propose BAMBOO, a multi-task long context benchmark. BAMBOO has been designed with four principles: comprehensive capacity evaluation, avoidance of data contamination, accurate automatic evaluation, and different length levels. It consists of 10 datasets from 5 different long text understanding tasks, i.e. question answering, hallucination detection, text sorting, language modeling, and code completion, to cover core capacities and various domains of LLMs. We conduct experiments with five long context models on BAMBOO and further discuss four key research questions of long text. We also qualitatively analyze current long context models and point out future directions for enhancing long text modeling capacities. We release our data, prompts, and code at https://github.com/RUCAIBox/BAMBOO.
翻译:大语言模型(LLMs)在处理常规长度自然语言处理任务时已展现出显著能力。近年来,多项研究致力于扩展上下文长度并提升LLMs的长文本建模能力。为全面评估LLMs的长上下文能力,我们提出BAMBOO——一个多任务长上下文基准测试。BAMBOO的设计遵循四项原则:能力全面评估、避免数据污染、自动评估准确以及涵盖不同长度层级。该基准包含来自5种长文本理解任务的10个数据集(即问答、幻觉检测、文本排序、语言建模及代码补全),覆盖LLMs核心能力与多领域应用。我们基于BAMBOO对五种长上下文模型进行实验,并进一步探讨了长文本领域的四个关键研究问题。同时,我们定性分析现有长上下文模型,指出提升长文本建模能力的未来方向。相关数据、提示词及代码已发布于https://github.com/RUCAIBox/BAMBOO。