Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI. Large language models (LLMs) today primarily self-improve via two mechanisms: self-reflection for context updates, and reinforcement learning (RL) for weight updates. In this work, we propose Evolutionary System Prompt Learning (E-SPL), a method for jointly improving model contexts and model weights. In each RL iteration, E-SPL selects multiple system prompts and runs rollouts with each in parallel. It applies RL updates to model weights conditioned on each system prompt, and evolutionary updates to the system prompt population via LLM-driven mutation and crossover. Each system prompt has a TrueSkill rating for evolutionary selection, updated from relative performance within each RL iteration batch. E-SPL encourages a natural division between declarative knowledge encoded in prompts and procedural knowledge encoded in weights, resulting in improved performance across reasoning and agentic tasks. For instance, in an easy-to-hard (AIME $\rightarrow$ BeyondAIME) generalization setting, E-SPL improves RL success rate from 38.8% $\rightarrow$ 45.1% while also outperforming reflective prompt evolution (40.0%). Overall, our results show that coupling reinforcement learning with system prompt evolution yields consistent gains in sample efficiency and generalization. Code: https://github.com/LunjunZhang/E-SPL
翻译:构建能够从经验中自主自我改进的智能体系统是人工智能的长期目标。当前,大型语言模型主要通过两种机制实现自我改进:通过自我反思进行上下文更新,以及通过强化学习进行权重更新。本研究提出进化式系统提示学习,一种联合优化模型上下文与模型权重的方法。在每次强化学习迭代中,该方法并行选择多个系统提示并分别执行轨迹采样。它基于每个系统提示对模型权重进行强化学习更新,同时通过大型语言模型驱动的突变与交叉对系统提示种群进行进化更新。每个系统提示均设有用于进化选择的TrueSkill评级,该评级根据每轮强化学习迭代批次中的相对表现进行更新。该方法促进了提示中编码的陈述性知识与权重中编码的程序性知识之间的自然分工,从而在推理与智能体任务中实现性能提升。例如,在从易到难的泛化场景中,该方法将强化学习成功率从38.8%提升至45.1%,同时优于反思式提示进化方法(40.0%)。总体而言,我们的研究结果表明,将强化学习与系统提示进化相结合,能在样本效率与泛化能力方面获得持续增益。代码:https://github.com/LunjunZhang/E-SPL