LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

翻译：长期记忆对于在专业化网络环境中的智能体至关重要，其成功依赖于对界面功能、状态动态、工作流程以及反复出现的失败模式的记忆能力。然而，现有针对智能体的记忆基准大多聚焦于用户历史、短轨迹或下游任务成功率，未能直接评估记忆系统能否有效内化环境特定经验。为解决这一问题，我们提出LongMemEval-V2（LME-V2）基准，用于评估记忆系统能否帮助智能体在定制化环境中积累成为知识型同事所需的经验。LME-V2包含451个精心设计的问题，覆盖网络智能体五大核心记忆能力：静态状态回忆、动态状态追踪、工作流知识、环境陷阱识别和前提意识。每个问题均配属包含最多500条轨迹（总计1.15亿token）的历史数据。我们采用上下文聚合框架：记忆系统消化历史轨迹后，为下游问答任务提供精简证据。我们提出两套记忆方法：AgentRunbook-R——基于高效RAG，通过知识池存储原始状态观测、事件和策略笔记；AgentRunbook-C——将轨迹存储为文件，并调用编码智能体在增强沙箱中收集证据。实验表明，AgentRunbook-C以72.5%的平均准确率取得最佳性能，优于最强RAG基线（48.5%）和现成编码智能体基线（69.3%）。尽管性能提升显著，基于编码智能体的方法存在高延迟成本。虽然AgentRunbook-C推进了准确率-延迟帕累托前沿，但仍有较大改进空间。综上，这些结果确立了LME-V2作为开发环境经验长期记忆系统的挑战性试验平台。