We present a theoretical study of continual and experiential learning in large language model agents that combine episodic memory with reinforcement learning. We argue that the key mechanism for continual adaptation, without updating model parameters, is reflection: the agent's ability to use past experience to guide future actions. Empirical findings suggest that episodic, experience-driven reflection enables generalised adaptation across a wide range of open-ended, long-horizon tasks. This indicates that efficient learning can occur during deployment and weakens the traditional separation between training and testing. Motivated by this, we introduce the Stateful Reflective Decision Process, a formal model of reflective memory dynamics. In this abstraction, an agent maintains an episodic memory and performs two core operations. Writing stores interaction outcomes and plays the role of policy evaluation. Reading retrieves relevant past cases to inform decisions and plays the role of policy improvement. This perspective treats reflective memory as a control object that can be analysed using classical reinforcement learning tools. We then develop a read-write reflective learning framework by integrating retrieval into soft policy iteration and establish convergence guarantees. We show that as memory grows and provides denser coverage of the state space, the resulting composite policy converges to the optimal solution. Overall, this framework connects practical memory-based methods with principled reinforcement learning, providing a rigorous mathematical basis for building reflective, memory-embedded agents capable of continual general-purpose learning.
翻译:本文对结合情景记忆与强化学习的大型语言模型智能体中的持续学习与经验学习进行了理论研究。我们认为,在不更新模型参数的情况下实现持续适应的关键机制是反思:即智能体利用过往经验指导未来行动的能力。实证研究结果表明,基于情景、由经验驱动的反思能够支持在广泛开放式、长周期任务中实现泛化适应。这表明高效学习可以在部署过程中发生,从而弱化了传统训练与测试之间的界限。受此启发,我们提出了有状态反思决策过程,这是一种关于反思性记忆动态的形式化模型。在此抽象框架中,智能体维护一个情景记忆并执行两个核心操作:写入操作存储交互结果,扮演策略评估的角色;读取操作检索相关历史案例以辅助决策,扮演策略改进的角色。这一视角将反思性记忆视为一个控制对象,可使用经典强化学习工具进行分析。随后,我们通过将检索机制整合到软策略迭代中,构建了一个读写反思学习框架,并建立了收敛性保证。我们证明,随着记忆容量的增长以及对状态空间覆盖密度的提高,所得复合策略将收敛至最优解。总体而言,该框架将基于记忆的实用方法与原则性强化学习相连接,为构建具备持续通用学习能力的反思性、记忆嵌入式智能体提供了严格的数学基础。