Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.
翻译:大型语言模型(LLM)智能体在众多基准测试中展现出强劲性能,但多数评估假设环境是静态的。然而,现实部署本质上是动态的,要求智能体持续将其知识、技能和行为与变化的环境及更新的任务条件对齐。为弥补这一差距,我们提出了EvoArena,一个将环境变化建模为终端、软件和社交领域渐进更新序列的基准套件。我们进一步提出了EvoMem,一种基于补丁的记忆范式,将记忆演化记录为结构化的更新历史,使智能体能够通过记忆变化推断环境演化。实验表明,当前智能体在EvoArena上表现不佳,在演化的终端、软件和社交偏好领域平均准确率仅为39.6%。EvoMem持续提升性能,在EvoArena上平均增益1.5%,并在GAIA和LoCoMo等标准基准上分别提升6.1%和4.8%。除单个任务外,EvoMem在EvoArena上进一步将链条级准确率提升3.7%,其中成功需完成一连串相关演化子任务。机制分析表明,EvoMem改善了记忆中的证据捕获,表明能更完整地保留演化的环境状态。我们的结果强调了在评估和记忆中对演化建模对智能体可靠部署的重要性。