Vision-Language-Action (VLA) models have become a prominent paradigm for embodied intelligence, yet further performance improvements typically rely on scaling up training data and model size -- an approach that is prohibitively expensive for robotics and fundamentally limited by data collection costs.We address this limitation with $\mathbf{RoVer}$, an embodied test-time scaling framework that uses a $\mathbf{Ro}$bot Process Reward Model (PRM) as a Test-Time $\mathbf{Ver}$ifier to enhance the capabilities of existing VLA models without modifying their architectures or weights. Specifically, RoVer (i) assigns scalar-based process rewards to evaluate the reliability of candidate actions, and (ii) predicts an action-space direction for candidate expansion/refinement. During inference, RoVer generates multiple candidate actions concurrently from the base policy, expands them along PRM-predicted directions, and then scores all candidates with PRM to select the optimal action for execution. Notably, by caching shared perception features, it can amortize perception cost and evaluate more candidates under the same test-time computational budget. Essentially, our approach effectively transforms available computing resources into better action decision-making, realizing the benefits of test-time scaling without extra training overhead. Our contributions are threefold: (1) a general, plug-and-play test-time scaling framework for VLAs; (2) a PRM that jointly provides scalar process rewards and an action-space direction to guide exploration; and (3) an efficient direction-guided sampling strategy that leverages a shared perception cache to enable scalable candidate generation and selection during inference.
翻译:视觉-语言-动作(VLA)模型已成为具身智能的重要范式,但其性能的进一步提升通常依赖于扩大训练数据和模型规模——这种方法在机器人领域成本高昂且从根本上受限于数据收集成本。我们通过 $\mathbf{RoVer}$ 解决了这一限制,这是一个具身测试时扩展框架,它使用 $\mathbf{Ro}$bot 过程奖励模型作为测试时 $\mathbf{Ver}$ifier,以增强现有 VLA 模型的能力,而无需修改其架构或权重。具体而言,RoVer (i) 分配基于标量的过程奖励来评估候选动作的可靠性,以及 (ii) 预测一个动作空间方向以指导候选动作的扩展/细化。在推理过程中,RoVer 从基础策略并发生成多个候选动作,沿 PRM 预测的方向扩展它们,然后使用 PRM 对所有候选动作进行评分,以选择最优动作执行。值得注意的是,通过缓存共享的感知特征,它可以分摊感知成本,并在相同的测试时计算预算下评估更多候选动作。本质上,我们的方法有效地将可用计算资源转化为更好的动作决策,在不增加额外训练开销的情况下实现了测试时扩展的优势。我们的贡献有三方面:(1) 一个通用的、即插即用的 VLA 测试时扩展框架;(2) 一个联合提供标量过程奖励和动作空间方向以指导探索的 PRM;以及 (3) 一种高效的方向引导采样策略,该策略利用共享感知缓存,在推理过程中实现可扩展的候选动作生成与选择。