A well-designed reward is critical for effective reinforcement learning-based policy improvement. In real-world robotic domains, obtaining such rewards typically requires either labor-intensive human labeling or brittle, handcrafted objectives. Vision-language models (VLMs) have shown promise as automatic reward models, yet their effectiveness on real robot tasks is poorly understood. In this work, we aim to close this gap by introducing (1) \textbf{RoboReward}, a robotics reward dataset and benchmark built on large-scale real-robot corpora from Open X-Embodiment (OXE) and RoboArena, and (2) vision-language reward models trained on this dataset (RoboReward 4B/8B). Because OXE is success-heavy and lacks failure examples, we propose a \emph{negative examples data augmentation} pipeline that generates calibrated \emph{negatives} and \emph{near-misses} via counterfactual relabeling of successful episodes and temporal clipping to create partial-progress outcomes from the same videos. Using this framework, we produce an extensive training and evaluation dataset that spans diverse tasks and embodiments and enables systematic evaluation of whether state-of-the-art VLMs can reliably provide rewards for robotics. Our evaluation of leading open-weight and proprietary VLMs reveals that no model excels across all tasks, underscoring substantial room for improvement. We then train general-purpose 4B- and 8B-parameter models that outperform much larger VLMs in assigning rewards for short-horizon robotic tasks. Finally, we deploy the 8B-parameter reward VLM in real-robot reinforcement learning and find that it improves policy learning over Gemini Robotics-ER 1.5, a frontier physical reasoning VLM trained on robotics data, by a large margin, while substantially narrowing the gap to RL training with human-provided rewards.
翻译:设计良好的奖励对于基于强化学习的策略改进至关重要。在现实世界机器人领域,获取此类奖励通常需要耗费大量人力进行标注,或依赖脆弱的手工设计目标。视觉语言模型(VLMs)已展现出作为自动奖励模型的潜力,但其在真实机器人任务中的有效性尚不明确。本研究旨在通过引入以下两项工作填补这一空白:(1)**RoboReward**——基于Open X-Embodiment(OXE)和RoboArena大规模真实机器人语料构建的机器人奖励数据集与基准测试;(2)基于该数据集训练的视觉语言奖励模型(RoboReward 4B/8B)。鉴于OXE数据集成功案例密集而缺乏失败样本,我们提出一种**负例数据增强**流程,通过对成功轨迹进行反事实重标注生成校准的**负例**与**接近成功**样本,并通过时序截取从相同视频中创建部分进展结果。利用该框架,我们构建了涵盖多样化任务与具身形态的扩展训练与评估数据集,从而能够系统评估最先进的VLM能否为机器人学提供可靠的奖励。对主流开源与专有VLM的评估表明,没有任何模型能在所有任务中表现优异,这凸显了巨大的改进空间。我们随后训练了通用型4B与80亿参数模型,在短视域机器人任务奖励分配方面超越了规模更大的VLM。最后,我们将80亿参数奖励VLM部署于真实机器人强化学习,发现其策略学习效果大幅优于基于机器人数据训练的前沿物理推理VLM——Gemini Robotics-ER 1.5,同时显著缩小了与人类提供奖励的强化学习训练之间的差距。