规模化验证在视觉-语言-动作对齐中可优于规模化策略学习 (Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment)

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap." We first characterize the test-time scaling laws for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce CoVer-VLA, a hierarchical test-time verification pipeline using the trained verifier. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses the verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer-VLA achieves 14% gains in task progress and 9% in success rate.

翻译：通用机器人的长期愿景取决于其理解和执行自然语言指令的能力。视觉-语言-动作（VLA）模型已在此目标上取得显著进展，但其生成的动作仍可能与给定指令存在偏差。本文研究测试时验证作为缩小“意图-动作差距”的手段。我们首先描述了具身指令跟随的测试时缩放规律，并证明联合扩增指令改写数量和生成动作数量可极大提升测试时样本多样性，通常比独立扩增任一维度更高效地恢复正确动作。为利用这些缩放规律，我们提出了CoVer——一种用于视觉-语言-动作对齐的对比验证器，并证明我们的架构能随计算资源和数据增加而优雅扩展。随后，我们引入CoVer-VLA——一种使用训练验证器的分层测试时验证流程。在部署时，我们的框架通过视觉语言模型（VLM）预计算多样化的改写指令集，为每条指令重复生成动作候选，再利用验证器选择最优的高层提示和低层动作片段。与在相同数据上扩展策略预训练相比，我们的验证方法在SIMPLER基准测试中实现分布内22%和分布外13%的性能提升，并在真实世界实验中取得额外45%的改进。在PolaRiS基准测试中，CoVer-VLA实现了任务进度14%和成功率9%的提升。