Long-context LLM agents must access the right evidence from large environments and use it faithfully. However, the popular Needle-in-a-Haystack (NIAH) evaluation mostly measures benign span localization. The needle is near-unique, and the haystack is largely irrelevant. We introduce EverMemBench-S (EMB-S), an adversarial NIAH-style benchmark built on a 326M-token MemoryBank. While the full MemoryBank spans 326M tokens for retrieval-based (RAG) evaluation, we evaluate native long-context models only at scales that fit within each model's context window (up to 1M tokens in this work) to ensure a fair comparison. EMB-S pairs queries with collision-tested near-miss hard negatives and gold evidence sets spanning one or more documents, validated via human screening and LLM verification. We also propose a decoupled diagnostic protocol that reports evidence access (document-ID localization) separately from end-to-end QA quality under full-context prompting. This enables consistent diagnosis for both native long-context prompting and retrieval pipelines. Across a reference-corpus ladder from domain-isolated 64K contexts to a globally shared 326M-token environment, we observe a clear reality gap. Systems that saturate benign NIAH degrade sharply in evidence access under semantic interference. These results indicate that semantic discrimination, not context length alone, is the dominant bottleneck for long-context memory at scale.
翻译:长上下文LLM智能体必须从庞大环境中访问正确的证据并忠实地使用它。然而,流行的“大海捞针”评估主要衡量的是良性片段定位能力,其中“针”近乎唯一,而“干草堆”基本无关。我们提出了EverMemBench-S,这是一个基于3.26亿令牌MemoryBank构建的对抗性“大海捞针”风格基准。完整的MemoryBank包含3.26亿令牌用于基于检索的评估,而对于原生长上下文模型,我们仅在其上下文窗口容量范围内进行评估,以确保公平比较。EMB-S将查询与经过碰撞测试的近似负例以及跨越一个或多个文档的黄金证据集配对,并通过人工筛选和LLM验证。我们还提出了一种解耦诊断协议,该协议将证据访问与端到端问答质量分开报告。这使得对原生长上下文提示和检索流水线都能进行一致的诊断。在从领域隔离的64K上下文到全局共享的3.26亿令牌环境的参考语料阶梯上,我们观察到一个明显的现实差距。在良性“大海捞针”任务中表现饱和的系统,在语义干扰下的证据访问能力急剧下降。这些结果表明,语义判别能力,而非仅仅是上下文长度,是大规模长上下文记忆的主要瓶颈。