While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.
翻译:尽管大型语言模型(LLM)智能体在通用任务上表现出色,但由于部署后模型权重被冻结,它们本质上难以实现持续适应。传统强化学习(RL)虽能提供解决方案,却会带来极高的计算成本与灾难性遗忘的风险。本文提出即时强化学习(JitRL),这是一种无需训练、可在测试时进行策略优化且完全不依赖梯度更新的框架。JitRL维护一个动态的非参数化经验记忆库,通过实时检索相关轨迹来估计动作优势值,并利用这些估计值直接调制LLM的输出逻辑值。我们从理论上证明,这种加法式更新规则正是KL约束策略优化目标的精确闭式解。在WebArena和Jericho基准上的大量实验表明,JitRL在无需训练的方法中达到了新的最优性能。关键的是,JitRL在超越计算成本高昂的微调方法(如WebRL)性能的同时,将经济成本降低了30倍以上,为持续学习智能体提供了一条可扩展的路径。代码已发布于https://github.com/liushiliushi/JitRL。