Reinforcement Learning (RL) has shown great potential in refining robotic manipulation policies, yet its efficacy remains strongly bottlenecked by the difficulty of designing generalizable reward functions. In this paper, we propose a framework for online policy refinement by adapting foundation VLMs into online reward generators. We develop a robust, scalable reward model based on a state-of-the-art VLM, trained on a large-scale, multi-source dataset encompassing real-world robot trajectories, human-object interactions, and diverse simulated environments. Unlike prior approaches that evaluate entire trajectories post-hoc, our method leverages the VLM to formulate a multifaceted reward signal comprising process, completion, and temporal contrastive rewards based on current visual observations. Initializing with a base policy trained via Imitation Learning (IL), we employ these VLM rewards to guide the model to correct sub-optimal behaviors in a closed-loop manner. We evaluate our framework on challenging long-horizon manipulation benchmarks requiring sequential execution and precise control. Crucially, our reward model operates in a purely zero-shot manner within these test environments. Experimental results demonstrate that our method significantly improves the success rate of the initial IL policy within just 30 RL iterations, demonstrating remarkable sample efficiency. This empirical evidence highlights that VLM-generated signals can provide reliable feedback to resolve execution errors, effectively eliminating the need for manual reward engineering and facilitating efficient online refinement for robot learning.
翻译:强化学习(RL)在优化机器人操作策略方面展现出巨大潜力,但其有效性仍受制于难以设计通用化奖励函数这一瓶颈。本文提出一种在线策略优化框架,通过将基础视觉-语言模型(VLM)适配为在线奖励生成器。我们基于先进VLM构建了鲁棒且可扩展的奖励模型,并在涵盖真实机器人轨迹、人-物交互及多样化仿真环境的大规模多源数据集上进行训练。区别于以往事后评估完整轨迹的方法,本方法利用VLM根据当前视觉观测构建包含过程奖励、完成奖励及时间对比奖励的多方面奖励信号。以通过模仿学习(IL)训练的初始策略为基础,我们采用这些VLM奖励以闭环方式引导模型修正次优行为。我们在需要顺序执行与精确控制的挑战性长时程操作基准上评估本框架。关键的是,本奖励模型在测试环境中以纯零样本方式运行。实验结果表明,本方法仅需30次RL迭代即可显著提升初始IL策略的成功率,展现出卓越的样本效率。这一实证表明,VLM生成的信号可为执行错误提供可靠反馈,有效消除人工奖励工程需求,促进机器人学习的在线高效优化。