We introduce EMemBench, a programmatic benchmark for evaluating long-term memory of agents through interactive games. Rather than using a fixed set of questions, EMemBench generates questions from each agent's own trajectory, covering both text and visual game environments. Each template computes verifiable ground truth from underlying game signals, with controlled answerability and balanced coverage over memory skills: single/multi-hop recall, induction, temporal, spatial, logical, and adversarial. We evaluate memory agents with strong LMs/VLMs as backbones, using in-context prompting as baselines. Across 15 text games and multiple visual seeds, results are far from saturated: induction and spatial reasoning are persistent bottlenecks, especially in visual setting. Persistent memory yields clear gains for open backbones on text games, but improvements are less consistent for VLM agents, suggesting that visually grounded episodic memory remains an open challenge. A human study further confirms the difficulty of EMemBench.
翻译:本文提出EMemBench,一个通过交互式游戏评估智能体长期记忆能力的程序化基准测试框架。与使用固定问题集不同,EMemBench基于每个智能体自身的交互轨迹动态生成问题,涵盖文本与视觉游戏环境。每个问题模板均从底层游戏信号中计算可验证的真实答案,具备可控的可回答性,并平衡覆盖多种记忆技能:单跳/多跳回忆、归纳推理、时序推理、空间推理、逻辑推理及对抗性记忆。我们以性能强大的语言模型/视觉语言模型为骨干构建记忆智能体,采用上下文提示方法作为基线进行评测。在15个文本游戏及多个视觉场景种子的测试中,结果远未饱和:归纳推理与空间推理仍是持续存在的性能瓶颈,在视觉环境中尤为显著。持久性记忆机制为文本游戏中的开源骨干模型带来明显增益,但对VLM智能体的改进效果较不稳定,这表明基于视觉的情景记忆仍是亟待解决的开放挑战。人工实验进一步证实了EMemBench任务的难度。