Long-context understanding has emerged as a critical capability for large language models (LLMs). However, evaluating this ability remains challenging. We present SCALAR, a benchmark designed to assess citation-grounded long-context reasoning in academic writing. SCALAR leverages academic papers and their citation structure to automatically generate high-quality ground-truth labels without human annotation. It features controllable difficulty levels and a dynamic updating mechanism that mitigates data contamination. The benchmark includes two tasks: a multiple-choice QA format and a cloze-style citation prediction. We evaluate a range of state-of-the-art LLMs and find that the multiple-choice task effectively distinguishes model capabilities. While human experts achieve over 90% accuracy, most models struggle. The cloze-style task is even more challenging, with no model exceeding 50% accuracy. SCALAR provides a domain-grounded, continuously updating framework for tracking progress in citation-based long-context understanding.
翻译:长上下文理解已成为大型语言模型(LLM)的一项关键能力。然而,评估这种能力仍然具有挑战性。我们提出了SCALAR,一个旨在评估学术写作中基于引用的长上下文推理能力的基准。SCALAR利用学术论文及其引用结构,无需人工标注即可自动生成高质量的真实标签。它具有可控的难度级别和动态更新机制,可减轻数据污染。该基准包含两项任务:多项选择题问答格式和完形填空式引用预测。我们评估了一系列最先进的大型语言模型,发现多项选择任务能有效区分模型能力。虽然人类专家能达到超过90%的准确率,但大多数模型表现不佳。完形填空式任务更具挑战性,没有模型的准确率超过50%。SCALAR提供了一个基于特定领域、持续更新的框架,用于追踪基于引用的长上下文理解能力的进展。