The hallmark of human intelligence is the ability to master new skills through Constructive Episodic Simulation-retrieving past experiences to synthesize solutions for novel tasks. While Large Language Models possess strong reasoning capabilities, they struggle to emulate this self-evolution: fine-tuning is computationally expensive and prone to catastrophic forgetting, while existing memory-based methods rely on passive semantic matching that often retrieves noise. To address these challenges, we propose MemRL, a framework that enables agents to self-evolve via non-parametric reinforcement learning on episodic memory. MemRL explicitly separates the stable reasoning of a frozen LLM from the plastic, evolving memory. Unlike traditional methods, MemRL employs a Two-Phase Retrieval mechanism that filters candidates by semantic relevance and then selects them based on learned Q-values (utility). These utilities are continuously refined via environmental feedback in an trial-and-error manner, allowing the agent to distinguish high-value strategies from similar noise. Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MemRL significantly outperforms state-of-the-art baselines. Our analysis experiments confirm that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates.
翻译:人类智能的标志性特征在于能够通过建构性情景模拟——检索过往经验以合成解决新任务的方案——来掌握新技能。尽管大语言模型具备强大的推理能力,但其难以模拟这种自我进化过程:微调方法计算成本高昂且易引发灾难性遗忘,而现有基于记忆的方法依赖被动的语义匹配,常检索到噪声信息。为应对这些挑战,我们提出MemRL框架,使智能体能够通过对情景记忆进行非参数化强化学习实现自我进化。MemRL明确将冻结大语言模型的稳定推理与可塑性记忆的进化过程相分离。与传统方法不同,MemRL采用两阶段检索机制:先通过语义相关性筛选候选记忆,再依据习得的Q值(效用)进行选择。这些效用值通过环境反馈以试错方式持续优化,使智能体能够从相似噪声中区分高价值策略。在HLE、BigCodeBench、ALFWorld和Lifelong Agent Bench上的大量实验表明,MemRL显著优于现有最先进基线方法。我们的分析实验证实,MemRL有效调和了稳定性与可塑性之间的矛盾,实现了无需权重更新的持续运行时改进。