Evaluating large language models' (LLMs) long-context understanding capabilities remains challenging. We present SCALAR (Scientific Citation-based Live Assessment of Long-context Academic Reasoning), a novel benchmark that leverages academic papers and their citation networks. SCALAR features automatic generation of high-quality ground truth labels without human annotation, controllable difficulty levels, and a dynamic updating mechanism that prevents data contamination. Using ICLR 2025 papers, we evaluate 8 state-of-the-art LLMs, revealing key insights about their capabilities and limitations in processing long scientific documents across different context lengths and reasoning types. Our benchmark provides a reliable and sustainable way to track progress in long-context understanding as LLM capabilities evolve.
翻译:评估大型语言模型(LLMs)的长上下文理解能力仍具挑战性。本文提出SCALAR(基于科学引用的长上下文学术推理实时评估),一种利用学术论文及其引用网络的新型基准。SCALAR具备以下特点:无需人工标注即可自动生成高质量真实标签、可控的难度级别,以及防止数据污染的动态更新机制。基于ICLR 2025论文,我们对8个前沿LLMs进行了评估,揭示了它们在不同上下文长度和推理类型下处理长篇幅科学文档时的核心能力与局限。本基准为持续追踪LLMs长上下文理解能力的演进提供了可靠且可持续的评估框架。