While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.
翻译:尽管视觉-语言-动作(VLA)模型在预训练方面取得了快速进展,但它们在强化学习(RL)中的发展仍然受到现实场景中样本效率低和奖励稀疏的阻碍。开发可泛化的过程奖励模型对于提供弥合这一差距所需的细粒度反馈至关重要,然而现有的时序价值函数往往无法泛化到其训练域之外。我们提出了TOPReward,一种新颖的、基于概率的时序价值函数,它利用预训练视频视觉-语言模型(VLM)的潜在世界知识来估计机器人任务进度。与先前方法(通过提示VLM直接输出进度值,容易产生数值误表示)不同,TOPReward直接从VLM内部的词元对数概率中提取任务进度。在涵盖130多个不同现实世界任务和多个机器人平台(例如Franka、YAM、SO-100/101)的零样本评估中,TOPReward在Qwen3-VL上实现了0.947的平均价值顺序相关性(VOC),显著优于最先进的GVL基线(该基线在同一开源模型上的相关性接近零)。我们进一步证明,TOPReward可作为下游应用(包括成功检测和奖励对齐的行为克隆)的多功能工具。