LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.
翻译:基于大语言模型的智能体越来越多地运行在持久化环境中,必须跨多个会话存储、更新和推理信息。虽然先前基准测试仅评估单实体更新,但MEME定义了覆盖多实体与演进维度完整空间的六项任务,包括三项未被先前工作评估的任务:级联与缺失(依赖推理)以及删除(移除后状态)。在100个受控章节中对跨越三种记忆范式的六个记忆系统进行评估后发现,尽管静态检索性能足够,但所有系统在默认配置下的依赖推理任务中均表现崩溃(级联平均准确率3%,缺失平均准确率1%)。提示优化、更深入检索、减少填充噪声以及多数更强的大语言模型均无法缩小这一差距。唯有将Claude Opus 4.7作为内部大语言模型、配以基于文件的智能体才能部分缩小差距,但其成本约为基线水平的70倍,表明当前弥合差距依赖于规模上不实用的配置。代码与数据可在项目页面获取:https://seokwonjung-jay.github.io/meme-eval/。