Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into memory, and subsequently use that memory to guide later actions to solve the overall task. MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning, and reveals that agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting, exposing a gap in current evaluations for agents with memory.
翻译:现有对具备记忆能力的智能体的评估通常孤立地测试记忆与行动能力。一类基准通过测试对过往对话或文本的回忆来评估记忆能力,但未能捕捉记忆如何用于指导未来决策。另一类基准则关注智能体在单会话任务中的行为,无需长期记忆参与。然而,在实际场景中,记忆与行动是紧密耦合的:智能体在与环境交互过程中获取记忆,随后依赖这些记忆解决未来任务。为刻画这一场景,我们提出了MemoryArena——一个用于在多会话“记忆-智能体-环境”循环中评估智能体记忆的统一测试平台。该基准包含人工设计的具有显式互依赖子任务的智能体任务,要求智能体通过将经验提炼为记忆来从先前的行动与反馈中学习,并随后利用该记忆指导后续行动以完成整体任务。MemoryArena支持在网页导航、偏好约束规划、渐进式信息检索及序列形式推理等多个维度进行评估,并揭示出在现有长上下文记忆基准(如LoCoMo)上性能接近饱和的智能体,在我们的智能体任务场景中表现不佳,这暴露了当前对具备记忆能力的智能体的评估存在缺陷。