RoVer: Robot Reward Model as Test-Time Verifier for Vision-Language-Action Model

Vision-Language-Action (VLA) models have become a prominent paradigm for embodied intelligence, yet further performance improvements typically rely on scaling up training data and model size -- an approach that is prohibitively expensive for robotics and fundamentally limited by data collection costs. We address this limitation with $\mathbf{RoVer}$, an embodied test-time scaling framework that uses a $\mathbf{Ro}$bot Process Reward Model (PRM) as a Test-Time $\mathbf{Ver}$ifier to enhance the capabilities of existing VLA models without modifying their architectures or weights. Specifically, RoVer (i) assigns scalar-based process rewards to evaluate the reliability of candidate actions, and (ii) predicts an action-space direction for candidate expansion/refinement. During inference, RoVer generates multiple candidate actions concurrently from the base policy, expands them along PRM-predicted directions, and then scores all candidates with PRM to select the optimal action for execution. Notably, by caching shared perception features, it can amortize perception cost and evaluate more candidates under the same test-time computational budget. Essentially, our approach effectively transforms available computing resources into better action decision-making, realizing the benefits of test-time scaling without extra training overhead. Our contributions are threefold: (1) a general, plug-and-play test-time scaling framework for VLAs; (2) a PRM that jointly provides scalar process rewards and an action-space direction to guide exploration; and (3) an efficient direction-guided sampling strategy that leverages a shared perception cache to enable scalable candidate generation and selection during inference.

翻译：视觉-语言-动作（VLA）模型已成为具身智能的重要范式，但其性能的进一步提升通常依赖于扩大训练数据与模型规模——这种方法对机器人学而言成本过高，且从根本上受限于数据采集开销。我们通过 $\mathbf{RoVer}$ 这一具身测试时扩展框架来解决此限制，该框架使用 $\mathbf{Ro}$bot 过程奖励模型（PRM）作为测试时 $\mathbf{Ver}$ifier，以增强现有 VLA 模型的能力，而无需修改其架构或权重。具体而言，RoVer（i）分配基于标量的过程奖励以评估候选动作的可靠性，并（ii）预测动作空间方向以指导候选动作的扩展/优化。在推理过程中，RoVer 从基础策略并行生成多个候选动作，沿 PRM 预测的方向对其进行扩展，然后使用 PRM 对所有候选动作进行评分，以选择最优动作执行。值得注意的是，通过缓存共享的感知特征，它可以分摊感知成本，并在相同的测试时计算预算下评估更多候选动作。本质上，我们的方法将可用计算资源有效转化为更优的动作决策，在不增加训练开销的情况下实现了测试时扩展的优势。我们的贡献包括三方面：（1）一个通用、即插即用的 VLA 测试时扩展框架；（2）一个能同时提供标量过程奖励和动作空间方向以指导探索的 PRM；（3）一种高效的方向引导采样策略，利用共享感知缓存实现推理过程中可扩展的候选动作生成与选择。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日