Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM-agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure-level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low-level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task-specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self-evolution capability over strong baselines. Our code has been released in https://github.com/antman9914/proplay.
翻译:摘要:自进化智能体期望在无需外部监督的情况下通过交互提升自身能力,但在部分可观测环境中实现这一目标仍面临挑战——智能体需要主动探索、从有限反馈中学习,并判断何时信任过往经验。现有的大语言模型(LLM)智能体方法通常依赖记忆或规划模块,却鲜少在两者之间建立闭环以持续优化对环境的动态理解。本文提出ProPlay,一种支持程序级预演的程序化世界模型,使智能体能够利用已习得的世界知识预演未来程序路径。不同于将经验表示为孤立规则或低层级动作约束,ProPlay将成功轨迹抽象为程序(procedure),并通过构建捕获任务阶段间因果转换关系的程序图(procedure graph)组织这些程序。每个转换关联一个可靠性记录嵌入(reliability record embedding),用以从历史结果中评估其对特定任务的贡献。在每个回合开始前,ProPlay基于已知图结构模拟未来程序轨迹,将其作为结构化软引导信息;执行后,则利用环境反馈优化该程序图。在公开基准上的实验表明,ProPlay在环境理解与自进化能力上始终优于强基线方法。相关代码已开源至https://github.com/antman9914/proplay。