MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce MINTEval (Long-Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, MINTEval has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are revised or interfered with by subsequent context, with accuracy degrading as the number of intervening updates increases.

翻译：真实世界的智能体在长期且动态演变的时间线上运行，信息会持续更新并可能在记忆中产生相互干扰，要求智能体能够精确回忆并对多条信息进行聚合推理。然而，现有基准测试侧重于静态、独立的记忆召回，未能捕捉演变记忆之间的动态交互。本文研究当前记忆增强型智能体在现实场景、高干扰、长时程设定下，跨不同领域与问题类型的表现。我们提出MINTEval（长时程多目标干扰记忆评估）基准，其特点包括：（1）包含频繁更新的长且高度互联的上下文，能诱发显著干扰；（2）涵盖多种领域（状态追踪、多轮对话、维基百科修订与GitHub提交），支持评估领域泛化能力；（3）包含多种问题类型以评估对干扰的鲁棒性，包括（i）需要从长上下文中检索特定目标的单目标召回任务，以及（ii）需要对多条相关信息进行聚合推理的多目标聚合任务。总体而言，MINTEval包含15.6k个问答对，上下文平均长度为138.8k词元，最长达每实例1.8M词元。我们评估了7个代表性系统，包括普通长上下文大语言模型、RAG以及记忆增强型智能体框架。在所有系统上，我们观察到持续的低性能表现（平均准确率27.9%），尤其是在需要基于多条证据进行聚合推理的问题上。分析表明，性能主要受限于检索能力与记忆构建能力。此外，当前记忆系统难以回忆起被后续上下文修改或干扰的早期事实，准确率随中间更新次数的增加而显著下降。