Vision-Language-Action (VLA) models, as large foundation models for embodied control, have shown strong performance in manipulation tasks. However, their performance comes at high inference cost. To improve efficiency, recent methods adopt action chunking, which predicts a sequence of future actions for open-loop execution. Although effective for reducing computation, open-loop execution is sensitive to environmental changes and prone to error accumulation due to the lack of close-loop feedback. To address this limitation, we propose Speculative Verification for VLA Control (SV-VLA), a framework that combines efficient open-loop long-horizon planning with lightweight closed-loop online verification. Specifically, SV-VLA uses a heavy VLA as a low-frequency macro-planner to generate an action chunk together with a planning context, while a lightweight verifier continuously monitors execution based on the latest observations. Conditioned on both the current observation and the planning context, the verifier compares the planned action against a closed-loop reference action and triggers replanning only when necessary. Experiments demonstrate that SV-VLA combines the efficiency of chunked prediction with the robustness of closed-loop control, enabling efficient and reliable VLA-based control in dynamic environments. Code is available: https://github.com/edsad122/SV-VLA.
翻译:视觉-语言-动作(VLA)模型作为具身控制领域的基础大模型,在操控任务中展现出卓越性能。然而,其高性能伴随高昂的推理成本。为提升效率,近期方法采用动作分块策略,即预测未来动作序列以实现开环执行。尽管该方法能有效降低计算开销,但由于缺乏闭环反馈,开环执行对环境变化敏感且易产生误差累积。针对这一局限,我们提出面向VLA控制的推测性验证框架(SV-VLA),该框架将高效的开环长时域规划与轻量级闭环在线验证相结合。具体而言,SV-VLA采用重型VLA作为低频宏观规划器生成动作分块及规划上下文,同时由轻量级验证器基于最新观测持续监控执行过程。验证器以当前观测与规划上下文为条件,将规划动作与闭环参考动作进行比对,仅在必要时触发重新规划。实验表明,SV-VLA融合了分块预测的效率与闭环控制的鲁棒性,实现了动态环境中高效可靠的VLA控制。代码已开源:https://github.com/edsad122/SV-VLA。