Although there are currently many benchmarks available for evaluating the long context understanding and reasoning capability of large language models, with the expansion of the context window in these models, the existing long context benchmarks are no longer sufficient for evaluating the long context understanding and reasoning capability of large language models. In this paper, we have developed a fresh long context evaluation benchmark, which we name it Marathon in the form of multiple choice questions, inspired by benchmarks such as MMLU, for assessing the long context comprehension capability of large language models quickly, accurately, and objectively. We have evaluated several of the latest and most popular large language models, as well as three recent and effective long context optimization methods, on our benchmark. This showcases the long context reasoning and comprehension capabilities of these large language models and validates the effectiveness of these optimization methods. Marathon is available at https://huggingface.co/datasets/Lemoncoke/Marathon.
翻译:尽管目前已有许多基准可用于评估大型语言模型的长文本理解与推理能力,但随着这些模型上下文窗口的扩展,现有的长文本基准已不足以全面评估大型语言模型的长文本理解与推理能力。在本文中,我们开发了一个全新的长文本评估基准,受MMLU等基准的启发,我们将其命名为Marathon,采用多项选择题的形式,旨在快速、准确且客观地评估大型语言模型的长文本理解能力。我们在此基准上评估了多个最新且最受欢迎的大型语言模型,以及三种近期有效的长文本优化方法。这展示了这些大型语言模型的长文本推理与理解能力,并验证了这些优化方法的有效性。Marathon可通过https://huggingface.co/datasets/Lemoncoke/Marathon获取。