Reinforcement learning for language agents increasingly depends on custom harnesses that manage long-running context, multi-turn tool use and multi-agent orchestration. However, porting these harnesses into RL environment interfaces remains difficult and often loses important training signals. We bridge this gap with polar, a rollout framework for scalable asynchronous RL over arbitrary agent harnesses. Polar treats the agent harness as a black box: it proxies LLM API calls, records token-level model interactions, and reconstructs token-faithful trajectories for training. Each rollout node efficiently manages runtime prewarming, agent execution, trajectory reconstruction, and evaluation in parallel, exposing asynchronous service endpoints that can be consumed by independent trainers at scale. This decoupled design makes Polar agnostic to agent harnesses, training infrastructure, and RL algorithms while improving compute utilization for long-running agent workloads. We validate polar by training agents on software-engineering tasks with popular coding harnesses. Using simple GRPO, polar improves Qwen3.5-4B by 22.6, 4.8, 0.6 and 6.2 points on SWE-Bench Verified with the Codex, Claude Code, Qwen Code and Pi harnesses, respectively. We further demonstrate Polar for offline data generation over custom harnesses and ablate trajectory reconstruction strategies. Polar rewrites its preceding work, Prorl Agent, and has been registered as one of NeMo Gym environments.
翻译:语言智能体的强化学习日益依赖于管理长上下文、多轮工具使用与多智能体编排的定制化智能体框架。然而,将这些智能体框架移植到强化学习环境接口中仍存在困难,且常丢失重要训练信号。为此,我们提出Polar——一种支持任意智能体框架的可扩展异步强化学习回滚框架。Polar将智能体框架视为黑盒:代理大语言模型API调用、记录令牌级模型交互,并重构令牌保真的轨迹用于训练。每个回滚节点并行高效管理运行时预热、智能体执行、轨迹重构与评估,暴露出可被独立训练器大规模消费的异步服务端点。这种解耦设计使Polar对智能体框架、训练基础设施与强化学习算法均保持无关性,同时提升长时智能体任务的计算利用率。我们通过在软件工程任务中使用主流编码框架训练智能体验证Polar。采用简单GRPO算法,Polar在SWE-Bench Verified基准上利用Codex、Claude Code、Qwen Code与Pi框架,分别将Qwen3.5-4B模型性能提升22.6、4.8、0.6和6.2个百分点。我们还展示了Polar在定制框架上离线生成数据的能力,并消融研究了轨迹重构策略。Polar重写了其前身工作Prorl Agent,已注册为NeMo Gym环境之一。