Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length. Recently, multiple studies have committed to extending the context length and enhancing the long text modeling capabilities of LLMs. To comprehensively evaluate the long context ability of LLMs, we propose BAMBOO, a multi-task long context benchmark. BAMBOO has been designed with four principles: comprehensive capacity evaluation, avoidance of data contamination, accurate automatic evaluation, and different length levels. It consists of 10 datasets from 5 different long text understanding tasks, i.e. question answering, hallucination detection, text sorting, language modeling, and code completion, to cover core capacities and various domains of LLMs. We conduct experiments with five long context models on BAMBOO and further discuss four key research questions of long text. We also qualitatively analyze current long context models and point out future directions for enhancing long text modeling capacities. We release our data, prompts, and code at https://github.com/RUCAIBox/BAMBOO.
翻译:大语言模型(LLMs)在常规长度的自然语言处理任务中已取得显著成效。近期,多项研究致力于扩展上下文长度并增强LLMs的长文本建模能力。为全面评估LLMs的长上下文能力,我们提出BAMBOO——一个多任务长上下文基准。BAMBOO的设计遵循四项原则:综合能力评估、避免数据污染、精确自动评估及不同长度层级。该基准包含来自5项不同长文本理解任务的10个数据集,即问答、幻觉检测、文本排序、语言建模和代码补全,以覆盖LLMs的核心能力与多领域应用。我们利用BAMBOO对五种长上下文模型进行了实验,并进一步探讨了长文本的四个关键研究问题。同时,我们定性分析了当前长上下文模型,并指出了提升长文本建模能力的未来方向。相关数据、提示词及代码已开源发布于https://github.com/RUCAIBox/BAMBOO。