Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy analysis of future models.
翻译:现有的合成长上下文大语言模型基准(如“大海捞针”)仅测试表面检索能力,但长上下文大语言模型在书籍长度的输入中检索、综合和推理信息的能力究竟如何?为回答此问题,我们创建了NoCha数据集,该数据集包含1,001对关于67部近期出版的英文虚构作品的真伪陈述对,这些陈述由阅读过原书的人工标注者撰写,且每对陈述间差异极小。与现有长上下文基准不同,我们的标注者确认NoCha中大部分陈述对需要基于全书内容进行全局推理才能验证。实验表明,尽管人类读者能轻松完成此任务,但我们评估的所有十种长上下文大语言模型均面临巨大挑战:开源模型的表现均未超过随机水平(尽管它们在合成基准测试中表现优异),而GPT-4o以55.8%的准确率位居榜首。进一步分析揭示:(1)平均而言,模型在仅需句子级检索的陈述对上表现远优于需要全局推理的陈述对;(2)即使对于正确标注的陈述,模型生成的决策解释也常不准确;(3)模型在包含大量世界观构建的科幻/奇幻类作品上表现显著更差。NoCha提出的方法论支持基准数据集的持续演进,并便于对未来模型进行便捷分析。