CorpusQA：一个用于语料库级分析与推理的千万令牌基准 (CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning)

While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts or rely on a "sparse retrieval" assumption-that answers can be derived from a few relevant chunks. This assumption fails for true corpus-level analysis, where evidence is highly dispersed across hundreds of documents and answers require global integration, comparison, and statistical aggregation. To address this critical gap, we introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework. By decoupling reasoning from textual representation, this framework creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic reasoning over vast, unstructured text without relying on fallible human annotation. We further demonstrate the utility of our framework beyond evaluation, showing that fine-tuning on our synthesized data effectively enhances an LLM's general long-context reasoning capabilities. Extensive experiments reveal that even state-of-the-art long-context LLMs struggle as input length increases, and standard retrieval-augmented generation systems collapse entirely. Our findings indicate that memory-augmented agentic architectures offer a more robust alternative, suggesting a critical shift is needed from simply extending context windows to developing advanced architectures for global information synthesis.

翻译：尽管大型语言模型现已能处理百万令牌级上下文，但其在整个文档库中进行推理的能力在很大程度上仍未得到验证。现有基准存在不足，主要局限于单个长文本或依赖于“稀疏检索”假设——即答案可从少数相关文本块中推导得出。这一假设无法满足真正的语料库级分析需求，因为证据高度分散在数百份文档中，且答案需要全局整合、比较与统计聚合。为填补这一关键空白，我们提出了CorpusQA——一个规模达千万令牌的新型基准，通过创新的数据合成框架生成。该框架通过将推理过程与文本表征解耦，创建了具有程序化保证真实答案的复杂计算密集型查询，挑战系统在不依赖易出错的人工标注的情况下对海量非结构化文本进行整体推理。我们进一步展示了该框架在评估之外的实用性：基于合成数据的微调能有效增强大型语言模型的一般长上下文推理能力。大量实验表明，即使是最先进的长上下文大型语言模型也会随着输入长度增加而表现不佳，而标准的检索增强生成系统则完全失效。我们的研究结果表明，记忆增强的智能体架构提供了更稳健的替代方案，这预示着研究重点需要从单纯扩展上下文窗口转向开发面向全局信息合成的高级架构。