Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a "streaming" manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising chunk-wise actions. It overlaps the latency of action generation and execution. Second, we design an action saliency-aware adaptive observation mechanism, thereby overlapping the latency of execution and observation. Without sacrificing performance, StreamingVLA achieves substantial speedup and improves the fluency of execution. It achieves a 2.4 $\times$ latency speedup and reduces execution halting by 6.5 $\times$.
翻译:视觉-语言-动作(VLA)模型在自然语言驱动的感知与控制任务中展现出卓越性能。然而,VLA模型的高计算成本在实际部署中(特别是资源受限的边缘平台)带来了显著效率挑战。由于VLA各阶段(观测、动作生成与执行)必须顺序执行且需等待前一阶段完成,系统面临频繁停顿与高延迟问题。为此,我们通过系统性分析识别快速流畅生成的挑战,提出使VLA具备以“流式”方式异步并行化各阶段的能力。首先,我们消除对动作分块机制的依赖,采用动作流匹配方法——通过学习动作流轨迹而非逐块去噪动作,实现动作生成与执行延迟的重叠。其次,我们设计具有动作显著性感知的自适应观测机制,从而重叠执行与观测阶段的延迟。在保持性能不损失的前提下,StreamingVLA实现了显著的加速效果与执行流畅度提升:延迟加速比达2.4倍,执行停顿减少6.5倍。