Diffusion-based vision-language-action models (dVLAs) are promising for embodied intelligence but are fundamentally limited in real-time deployment by the high latency of full inference. We propose Realtime-VLA FLASH, a speculative inference framework that eliminates most full inference calls during replanning by introducing a lightweight draft model with parallel verification via the main model's Action Expert and a phase-aware fallback mechanism that reverts to the full inference pipeline when needed. This design enables low-latency, high-frequency replanning without sacrificing reliability. Experiments show that on LIBERO, FLASH largely preserves task performance by replacing many 58.0 ms full-inference rounds with speculative rounds as fast as 7.8 ms, lowering task-level average inference latency to 19.1 ms (3.04x speedup). We additionally demonstrate effectiveness on real-world conveyor-belt sorting, highlighting its practical impact for latency-critical embodied tasks.
翻译:基于扩散的视觉-语言-动作模型(dVLAs)在具身智能领域具有巨大潜力,但其完整推理的高延迟从根本上限制了实时部署能力。我们提出Realtime-VLA FLASH,一种推测推理框架,该框架通过在重规划过程中引入轻量级草稿模型,结合主模型动作专家的并行验证机制,以及必要时回退至完整推理流程的相位感知后备机制,有效减少了大部分完整推理调用。这种设计在不牺牲可靠性的前提下实现了低延迟、高频率的重规划。实验表明,在LIBERO基准上,FLASH通过将大量58.0毫秒的完整推理轮次替换为最快仅7.8毫秒的推测轮次,在保持任务性能的同时,将任务级平均推理延迟降至19.1毫秒(加速比3.04倍)。我们还在真实世界的传送带分拣任务中验证了其有效性,凸显了该框架对延迟敏感的具身任务的实用价值。