We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen-Verse/Open-AgentRL
翻译:我们提出了RLAnything,一种通过闭环优化动态锻造环境、策略和奖励模型的强化学习框架,旨在放大学习信号并增强适用于任何大语言模型或智能体场景的整体强化学习系统。具体而言,策略通过整合来自逐步信号和结果信号的反馈进行训练,而奖励模型则通过一致性反馈进行联合优化,这反过来进一步改进了策略训练。此外,我们基于理论启发的自动环境适应机制,通过利用来自奖励模型和策略模型的批评反馈,实现了从经验中学习,从而改善了两者的训练效果。实验表明,每个新增组件均能持续提升整体系统性能,RLAcross在各种代表性大语言模型和智能体任务上取得了显著增益:在OSWorld上将Qwen3-VL-8B-Thinking提升了9.1%,在AlfWorld和LiveBench上分别将Qwen2.5-7B-Instruct提升了18.7%和11.9%。我们还发现,优化后的奖励模型信号优于依赖人工标注的结果。代码:https://github.com/Gen-Verse/Open-AgentRL