Vision-Language-Action models (VLAs) have demonstrated strong task understanding and generalization in robotic manipulation, yet the high computational cost of full-model inference limits their deployment in low-latency, high-frequency closed-loop control. We propose an asynchronous semantic-action decoupling framework that separates semantic understanding from action generation along the internal semantic-action interface of existing VLAs, without redesigning the vision-language backbone or introducing an external planner. A low-frequency understanding module asynchronously updates reusable semantic conditions, while a high-frequency action module continuously outputs control actions without repeatedly invoking the full model. To mitigate the temporal mismatch between stale semantics and the current execution state, we further introduce historical action conditioning and time-misalignment training, which provide short-horizon execution context and improve feedback control robustness under stale semantic conditions. Experiments on LIBERO with $π_{0.5}$ and UniVLA, together with real-robot deployment using UniVLA, show that the proposed framework achieves up to 35.6 Hz server-side action-module inference throughput and offers a low-intrusion path to high-frequency closed-loop control without running full VLA inference at control rate.
翻译:视觉-语言-动作模型(VLAs)在机器人操作任务中展现出强大的任务理解与泛化能力,但全模型推理的高计算成本限制了其在低延迟、高频率闭环控制中的部署。本文提出一种异步语义-动作解耦框架,在不重新设计视觉-语言主干网络或引入外部规划器的情况下,沿现有VLA内部语义-动作接口,将语义理解与动作生成分离。低频理解模块异步更新可复用的语义条件,而高频动作模块持续输出控制动作,无需反复调用完整模型。为缓解过时语义与当前执行状态间的时间错配问题,我们进一步引入历史动作条件化与时间错位训练,提供短时域执行上下文,提升过时语义条件下的反馈控制鲁棒性。在LIBERO基准上基于π₀.₅与UniVLA的实验,以及使用UniVLA的真实机器人部署表明,该框架可实现最高35.6 Hz的服务端动作模块推理吞吐量,为在不以控制频率运行完整VLA推理的前提下,实现高频闭环控制提供了一条低侵入路径。