Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.
翻译:近期针对大语言模型智能体的基准测试主要评估推理、规划与执行能力。然而,记忆对智能体同样至关重要——它使智能体能够随时间存储、更新和检索信息。这一能力长期未得到充分评估,主要原因是现有基准测试缺乏系统性方法衡量记忆机制。本文从自我进化视角研究智能体记忆,提出统一基准测试EvoMemBench,该测试沿记忆范围(情节内vs.跨情节)与记忆内容(知识导向vs.执行导向)两个维度构建。我们在标准化协议下对比了15种代表性记忆方法与强长上下文基线模型。结果表明,当前记忆系统距离通用解决方案仍存在显著差距:长上下文基线方法依然极具竞争力;记忆机制仅在当前上下文不足或任务困难时效果显著;且没有任何单一记忆形式能在所有场景中保持稳定表现。检索型方法在知识密集场景中表现强劲,而程序性记忆与长期记忆方法在执行导向任务中,当其存储经验与任务结构匹配时更为有效。我们期望EvoMemBench能推动面向大语言模型智能体更有效记忆系统的未来研究。相关代码已开源至https://github.com/DSAIL-Memory/EvoMemBench。