The rapid advancement of Large Language Models (LLMs) has introduced a new frontier in natural language processing, particularly in understanding and processing long-context information. However, the evaluation of these models' long-context abilities remains a challenge due to the limitations of current benchmarks. To address this gap, we introduce NovelQA, a benchmark specifically designed to test the capabilities of LLMs with extended texts. Constructed from English novels, NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding in LLMs. This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types. Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance, particularly emphasizing the challenges they face with multi-hop reasoning, detail-oriented questions, and extremely long input with an average length more than 200,000 tokens. The results underscore the necessity for further advancements in LLMs to improve their long-context comprehension.
翻译:大型语言模型(LLM)的快速发展为自然语言处理开辟了新前沿,尤其是在长上下文信息的理解与处理方面。然而,由于现有基准测试的局限性,评估这些模型的长上下文能力仍面临挑战。为填补这一空白,我们提出了NovelQA——一个专门为测试LLM处理长文本能力而设计的基准测试集。该数据集基于英文小说构建,兼具文本复杂性、长度与叙事连贯性的独特优势,成为评估LLM深层文本理解能力的理想工具。本文详细阐述了NovelQA的设计与构建过程,重点介绍了其人工标注机制与多样化问题类型。通过对长上下文LLM在NovelQA上的评估,我们获得了关于模型性能的重要发现,尤其揭示了模型在处理多跳推理、细节导向问题以及平均长度超过20万标记的超长输入时所面临的挑战。这些结果凸显了LLM在长上下文理解能力方面仍需进一步突破的必要性。