Training-free verbal reinforcement learning enables LLM agents to learn from world feedback -- objective signals such as dynamic task outcomes, market returns, or demand forecasts -- by extracting verbal rules from experience and injecting them as context, updating the agent's behavior without parameter changes. However, in non-stationary environments these agents face a retention-forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma -- outcome-driven evaluation, persistent structured evidence, non-monotonic knowledge lifecycle, and compositional governance -- and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three-layer architecture -- rules, evidence, and skills -- connected by a feedback-driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule's reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non-stationary, we show that the same accumulated experience either degrades performance below the zero-shot baseline or dramatically improves accuracy and risk-adjusted returns, depending on whether the curation loop is present.
翻译:无训练语词强化学习使大语言模型智能体能够通过从经验中提取语词规则并注入上下文,从世界反馈——如动态任务结果、市场收益或需求预测等客观信号中学习,无需调整参数即可更新智能体行为。然而,在非平稳环境中,这些智能体面临保留-遗忘困境:保留陈旧的洞见会导致负迁移,而删除它们又会在条件重现时引发灾难性遗忘。我们识别出应对此困境的四项需求——结果驱动评估、持久结构化证据、非单调知识生命周期与组合式治理——并指出现有方法在经验提取上投入过多,而在洞见治理上投入不足。我们提出一个三层架构——规则、证据与技能——通过反馈驱动的策展循环连接以填补治理缺口。规则捕获来自世界结果的蒸馏经验;证据日志追踪每条规则在跨回合中的可靠性;技能管控应应用哪些规则、如何解决冲突以及何时避免决策。以金融预测为案例研究——在此场景中世界反馈自然丰富、嘈杂且非平稳——我们证明相同的累积经验要么使性能退化至零样本基线以下,要么显著提升准确率与风险调整收益,这取决于是否启用策展循环。