MEME: Multi-entity & Evolving Memory Evaluation

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.

翻译：基于大语言模型的智能体越来越多地运行在持久化环境中，必须跨多个会话存储、更新和推理信息。虽然先前基准测试仅评估单实体更新，但MEME定义了覆盖多实体与演进维度完整空间的六项任务，包括三项未被先前工作评估的任务：级联与缺失（依赖推理）以及删除（移除后状态）。在100个受控章节中对跨越三种记忆范式的六个记忆系统进行评估后发现，尽管静态检索性能足够，但所有系统在默认配置下的依赖推理任务中均表现崩溃（级联平均准确率3%，缺失平均准确率1%）。提示优化、更深入检索、减少填充噪声以及多数更强的大语言模型均无法缩小这一差距。唯有将Claude Opus 4.7作为内部大语言模型、配以基于文件的智能体才能部分缩小差距，但其成本约为基线水平的70倍，表明当前弥合差距依赖于规模上不实用的配置。代码与数据可在项目页面获取：https://seokwonjung-jay.github.io/meme-eval/。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

ICML 2026｜MEMOPILOT：用强化学习训练会进化的智能体记忆

专知会员服务

3+阅读 · 6月13日

大语言模型智能体中的外显化机制：记忆、技能、协议与评测基准工程综述

专知会员服务

34+阅读 · 4月19日

管理 LLM 智能体中的演进式记忆：风险、机理及稳定性与安全性受控记忆（SSGM）框架

专知会员服务

16+阅读 · 3月14日

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

专知会员服务

28+阅读 · 2月27日