Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length. Recently, multiple studies have committed to extending the context length and enhancing the long text modeling capabilities of LLMs. To comprehensively evaluate the long context ability of LLMs, we propose BAMBOO, a multi-task long context benchmark. BAMBOO has been designed with four principles: comprehensive capacity evaluation, avoidance of data contamination, accurate automatic evaluation, and different length levels. It consists of 10 datasets from 5 different long text understanding tasks, i.e. question answering, hallucination detection, text sorting, language modeling, and code completion, to cover core capacities and various domains of LLMs. We conduct experiments with five long context models on BAMBOO and further discuss four key research questions of long text. We also qualitatively analyze current long context models and point out future directions for enhancing long text modeling capacities. We release our data, prompts, and code at https://github.com/RUCAIBox/BAMBOO.
翻译:大语言模型(LLMs)在常规长度的自然语言处理任务中已展现出卓越性能。近年来,多项研究致力于扩展上下文长度并增强LLMs的长文本建模能力。为全面评估LLMs的长上下文能力,我们提出BAMBOO——一个多任务长上下文基准。BAMBOO的设计遵循四项原则:综合能力评估、避免数据污染、精确自动化评估以及不同长度等级。该基准包含来自5项不同长文本理解任务的10个数据集,涵盖问答、幻觉检测、文本排序、语言建模和代码补全,以覆盖LLMs的核心能力与多领域应用。我们基于BAMBOO对五种长上下文模型进行实验,并进一步探讨长文本的四个关键研究问题。同时,我们对当前长上下文模型进行定性分析,并指出提升长文本建模能力的未来方向。相关数据、提示词和代码已开源在https://github.com/RUCAIBox/BAMBOO。