VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models

Vision-language-action (VLA) models have shown strong promise for robotic manipulation, but their reliability at test time remains limited by one-shot action prediction, where even small action errors can cause grasp failure, collision, or incorrect task progression. A natural alternative is to equip VLA systems with test-time verification, allowing multiple candidate actions to be proposed and evaluated before execution. However, reliable action verification is challenging because it requires not only distinguishing subtle geometric differences between candidate actions, but also assessing whether an action makes meaningful progress toward the task goal. We present VeriSpace, a 3D-aware action verifier for test-time action selection in VLA systems. VeriSpace evaluates candidate actions through two key components: Dual-Path 3D-Injected Scene Encoding, which constructs a scene representation that jointly preserves visual semantics and explicit 3D geometry, and Spatially-Grounded Action Reasoning, which evaluates each action by reasoning over task-relevant spatial relations, geometric validity, and expected goal progress. Together, these components enable more reliable discrimination between subtle yet outcome-critical action candidates while remaining fully compatible with existing VLA policies. Experiments on public benchmarks and real-world robotic manipulation tasks show that VeriSpace consistently improves decision reliability over both underlying VLA policies and prior verification-based methods, yielding substantial gains in both in-distribution and out-of-distribution settings.

翻译：摘要：视觉-语言-动作（VLA）模型在机器人操作任务中展现出强大的潜力，但其测试时的可靠性仍受限于单次动作预测机制——即使是微小的动作误差都可能导致抓取失败、碰撞或任务进程错误。一种自然的改进方案是为VLA系统配备测试时验证模块，使其能在执行前提出并评估多个候选动作。然而，可靠的行动验证面临双重挑战：不仅需要区分候选动作间细微的几何差异，还需评估动作对任务目标的实质性推进程度。本文提出VeriSpace——一种面向VLA系统测试时动作选择的3D感知验证器。该验证器通过两个核心组件评估候选动作：双路径3D注入场景编码模块，构建同时保留视觉语义与显式3D几何信息的场景表征；以及空间基础动作推理模块，通过推理与任务相关的空间关系、几何有效性及预期目标进展对每个动作进行评估。这些组件协同作用，可更可靠地区分微妙但影响结果的关键动作候选，同时保持与现有VLA策略的完全兼容性。在公开基准测试与真实机器人操作任务上的实验表明，VeriSpace在分布内与分布外场景中均能持续提升决策可靠性，显著优于原始VLA策略及基于验证的既有方法。