Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.

翻译：通用机器人的长期愿景取决于其理解和执行自然语言指令的能力。视觉-语言-动作（VLA）模型已在此目标上取得显著进展，但其生成的动作仍可能与给定指令存在偏差。本文研究测试时验证作为缩小"意图-动作差距"的方法。我们首先描述了具身指令跟随的测试时扩展规律，证明联合扩展指令改写数量和生成动作数量能显著提升测试时样本多样性，通常比独立扩展各维度更高效地恢复正确动作。为利用这些扩展规律，我们提出CoVer——一种用于视觉-语言-动作对齐的对比验证器，并证明该架构能随计算资源和数据的增加而优雅扩展。随后我们为VLA模型引入"启动时计算"和分层验证推理流程。在部署阶段，我们的框架通过视觉语言模型（VLM）预计算多样化的改写指令集，为每条指令重复生成动作候选，再通过验证器选择最优高层提示和低层动作片段。与在相同数据上扩展策略预训练相比，我们的验证方法在SIMPLER基准测试中实现分布内22%和分布外13%的性能提升，在真实世界实验中进一步获得45%改进。在PolaRiS基准测试中，CoVer在任务进度和成功率上分别取得14%和9%的提升。